# The Mechanistic Emergence of Symbol Grounding in Language Models
**Authors**:
- Freda Shi Joyce Chai (University of Michiganâ
University of Waterlooâ
Vector Instituteâ
UNC at Chapel Hill)
## Abstract
Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.
footnotetext: Authors contributed equally to this work. footnotetext: Advisors contributed equally to this work.
## 1 Introduction
Symbol grounding (Harnad, 1990) refers to the problem of how abstract and discrete symbols, such as words, acquire meaning by connecting to perceptual or sensorimotor experiences. Extending to the context of multimodal machine learning, grounding has been leveraged as an explicit pre-training objective for vision-language models (VLMs), by explicitly connecting linguistic units to the world that gives language meanings (Li et al., 2022; Ma et al., 2023). Through supervised fine-tuning with grounding signals, such as entity-phrase mappings, modern VLMs have achieved fine-grained understanding at both region (You et al., 2024; Peng et al., 2024; Wang et al., 2024) and pixel (Zhang et al., 2024b; Rasheed et al., 2024; Zhang et al., 2024a) levels.
With the rising of powerful autoregressive language models (LMs; OpenAI, 2024; Anthropic, 2024; Comanici et al., 2025, inter alia) and their VLM extensions, there is growing interest in identifying and interpreting their emergent capabilities. Recent work has shown preliminary correlational evidence that grounding may emerge in LMs (Sabet et al., 2020; Shi et al., 2021; Wu et al., 2025b) and VLMs (Cao et al., 2025; Bousselham et al., 2024; Schnaus et al., 2025) trained at scale, even when solely optimized with the simple next-token prediction objective. However, the potential underlying mechanisms that lead to such an emergence are not well understood. To address this limitation, our work seeks to understand the emergence of symbol grounding in LMs, causally and mechanistically tracing how symbol grounding arises within the internal computations.
We begin by constructing a minimal testbed, motivated by the annotations provided in the CHILDES corpora (MacWhinney, 2000), where childâcaregiver interactions provide cognitively plausible contexts for studying symbol grounding alongside verbal utterances. In our framework, each word is represented in two distinct forms: one token that appears in non-verbal scene descriptions (e.g., a box in the environment) and another that appears in spoken utterances (e.g., box in dialogue). We refer to these as environmental tokens ( $\langle$ ENV $\rangle$ ) and linguistic tokens ( $\langle$ LAN $\rangle$ ), respectively. A deliberately simple word-level tokenizer assigns separate vocabulary entries to each form, ensuring that they are treated as entirely different tokens by the language model. This framework enforces a structural separation between scenes and symbols, preventing correspondences from being reduced to trivial token identity. Under this setup, we can evaluate whether a model trained from scratch is able to predict the linguistic form from its environmental counterpart.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Token Grounding and Information Aggregation
### Overview
The image illustrates a two-tiered token grounding process, mapping environmental tokens (<ENV>) to linguistic tokens (<LAN>). It emphasizes the relationship between concrete environmental concepts (e.g., "horse") and their linguistic representations, with explicit grounding via a highlighted connection.
### Components/Axes
1. **Sections**:
- **Environmental Tokens (<ENV>)**: Top row, labeled with `<ENV>` tags.
- **Linguistic Tokens (<LAN>)**: Bottom row, labeled with `<LAN>` tags.
2. **Highlighted Tokens**:
- `horse_<ENV>` (yellow background, green border).
- `the_<LAN>` (green border, connected via arrow).
3. **Arrow**:
- Green arrow labeled "Grounding (Information Aggregation)" links `horse_<ENV>` to `the_<LAN>`.
### Detailed Analysis
#### Environmental Tokens (<ENV>)
- Sequence: `<CHI> painted_<ENV> a_<ENV> picture_<ENV> of_<ENV> a_<ENV> horse_<ENV>`.
- Structure:
- `<CHI>`: Likely a context or scene identifier.
- Tokens describe a painted picture of a horse, with `<ENV>` tags indicating environmental grounding.
- `horse_<ENV>` is emphasized via color (yellow) and the grounding arrow.
#### Linguistic Tokens (<LAN>)
- Sequence: `<CHI> my_<LAN> favorite_<LAN> animal_<LAN> is_<LAN> the_<LAN> horse_<LAN>`.
- Structure:
- `<CHI>`: Matches the environmental section, suggesting shared context.
- Tokens form a sentence fragment: "my favorite animal is the horse."
- `the_<LAN>` is highlighted, mirroring the environmental `horse_<ENV>` via the arrow.
#### Grounding Mechanism
- The arrow explicitly connects `horse_<ENV>` (environmental) to `the_<LAN>` (linguistic), indicating a semantic mapping.
- Both highlighted tokens share a green border, reinforcing their linkage.
### Key Observations
1. **Repetition of `<CHI>`**: Appears in both sections, possibly denoting a shared context or identifier.
2. **Token Alignment**:
- Environmental tokens describe a scene (`painted picture of a horse`).
- Linguistic tokens form a sentence fragment referencing the same scene.
3. **Highlighting**:
- `horse_<ENV>` and `the_<LAN>` are visually linked, suggesting they represent the same entity across modalities.
4. **Tagging**:
- `<ENV>` and `<LAN>` tags differentiate token types, critical for grounding tasks.
### Interpretation
This diagram demonstrates **cross-modal grounding**, where environmental data (e.g., visual or sensory tokens) is mapped to linguistic representations. The highlighted connection between `horse_<ENV>` and `the_<LAN>` implies that the system aggregates information to associate concrete entities (e.g., a horse in a scene) with their linguistic counterparts (e.g., the word "the horse").
The repetition of `<CHI>` suggests a shared context or identifier, possibly denoting a specific instance or scenario. The grounding arrow acts as a bridge between modalities, emphasizing the importance of aligning environmental and linguistic data for tasks like natural language understanding or multimodal AI systems.
Notably, the absence of `<LAN>` tags on `horse_<LAN>` (it is grayed out) may indicate it is inferred or derived from the grounding process rather than explicitly labeled. This aligns with how grounding often involves implicit mappings rather than explicit annotations.
</details>
(a) Attention head 8 of layer 7 in GPT-CHILDES.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: Environmental Token Grounding Process
### Overview
The diagram illustrates the relationship between environmental tokens (<ENV>) and linguistic tokens (<LAN>) through a grounding process called "Information Aggregation." It combines a visual representation of a llama with textual analysis and token segmentation.
### Components/Axes
1. **Visual Element**:
- Image of a llama in a fenced enclosure with desert vegetation
- Colored bounding boxes (red/yellow) on the llama's body
2. **Textual Component**:
- Question: "what would you name this ? alpaca"
- Words segmented into individual dark blue boxes
- Green box highlighting the question mark ("?")
3. **Connecting Element**:
- Green arrow from yellow box on llama to "this ?" in text
4. **Token Labels**:
- Environmental Tokens (<ENV>)
- Linguistic Tokens (<LAN>)
### Detailed Analysis
1. **Visual Annotation**:
- Red boxes: Likely represent environmental features (e.g., "desert", "fence")
- Yellow box: Highlights the llama as the primary subject
2. **Text Segmentation**:
- Each word in "what would you name this ? alpaca" is individually boxed
- Question mark ("?") emphasized with green box
3. **Token Flow**:
- Green arrow connects visual grounding (llama) to linguistic output ("alpaca")
- Suggests information flow from environmental context to language model
### Key Observations
1. The grounding process transforms visual input into structured linguistic tokens
2. The question mark acts as a critical junction between perception and language
3. Color coding differentiates token types:
- Red/Yellow: Environmental features
- Dark Blue: Linguistic tokens
- Green: Connection/grounding element
4. "alpaca" appears as the final output token, disconnected from the question structure
### Interpretation
This diagram demonstrates a multimodal grounding process where:
1. Environmental context (visual scene) is analyzed through tokenized features
2. The system generates a question ("what would you name this ?") to bridge perception and language
3. The question mark serves as the critical interface between visual and linguistic processing
4. The final output ("alpaca") emerges from aggregating environmental information through the grounding mechanism
The color-coded tokenization suggests a structured approach to:
- Spatial analysis (ENV tokens)
- Semantic decomposition (LAN tokens)
- Contextual integration (green arrow connection)
The absence of numerical values indicates this is a conceptual diagram rather than data visualization, focusing on process flow rather than quantitative analysis.
</details>
(b) Attention head 7 of layer 20 in LLaVA-1.5-7B.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Heatmap Visualization: Attention and Saliency Analysis
### Overview
The image presents a dual-part visualization:
1. **Left Grid**: A 12x12 matrix labeled "layer" (vertical axis) and "head" (horizontal axis), with a highlighted cell at layer 7, head 8.
2. **Right Heatmap**: A word-based saliency map with a color scale from 0.0 (dark purple) to 0.3 (yellow), highlighting the word "horse" in yellow.
The visualization connects the left grid to the right heatmap via yellow arrows, suggesting a relationship between attention patterns and word saliency.
---
### Components/Axes
#### Left Grid (Attention Matrix)
- **Vertical Axis (Layer)**: Labeled "layer" with values 1â12.
- **Horizontal Axis (Head)**: Labeled "head" with values 1â12.
- **Highlighted Cell**: Layer 7, Head 8 (marked with a yellow square).
- **Color Scale**: Not explicitly labeled, but the highlighted cell is yellow, implying higher attention.
#### Right Heatmap (Saliency Map)
- **Vertical Axis (Words)**: Contains phrases like `<CHI> painted a picture of a horse <CHI> my favorite animal is the`.
- **Horizontal Axis (Words)**: Contains phrases like `<ENV> <LAN>`.
- **Color Scale**: Labeled "saliency" with values 0.0 (dark purple) to 0.3 (yellow).
- **Highlighted Cell**: The word "horse" (yellow).
#### Legend
- **Color Bar**: Positioned to the right of the heatmap, transitioning from dark purple (0.0) to yellow (0.3).
---
### Detailed Analysis
#### Left Grid (Attention Matrix)
- **Structure**: 12x12 grid with uniform dark purple cells except for the highlighted cell at (7, 8).
- **Trend**: No discernible pattern in the grid; the highlighted cell is an outlier.
- **Uncertainty**: No numerical values provided for other cells, only the highlighted cellâs saliency is implied.
#### Right Heatmap (Saliency Map)
- **Structure**: 12x12 grid with varying shades of purple and yellow.
- **Key Data Points**:
- The word "horse" is the brightest (yellow), indicating the highest saliency.
- Other words (e.g., "painted," "picture," "favorite") show lower saliency (darker purple).
- **Trend**: The saliency decreases from "horse" outward, with no other words reaching the yellow threshold.
---
### Key Observations
1. **Outlier in Left Grid**: The highlighted cell at layer 7, head 8 is the only cell with a distinct color, suggesting it is the most active attention head in that layer.
2. **Saliency Focus**: The word "horse" dominates the right heatmap, indicating it is the most salient term in the text.
3. **Connection**: The yellow arrows linking the left grid to the right heatmap imply that the attention in layer 7, head 8 is directly tied to the saliency of "horse."
---
### Interpretation
- **Attention-Saliency Relationship**: The visualization suggests that the modelâs attention in layer 7, head 8 is concentrated on the word "horse," which is the most salient term in the text. This could indicate that the model prioritizes this word for tasks like classification or generation.
- **Model Behavior**: The lack of other highlighted cells in the left grid implies that this specific head-layer combination is uniquely responsible for processing the salient word.
- **Implications**: This could reflect how the model encodes specific concepts (e.g., "horse") in its internal representations, with certain attention heads specializing in particular linguistic features.
---
**Note**: The image does not contain numerical values for non-highlighted cells, limiting quantitative analysis. The interpretation relies on visual cues and the explicit connection between the grid and heatmap.
</details>
(c) Left: saliency over tokens of each head in each layer for the prompt $\langle$ CHI $\rangle$ $\textit{painted}_{\texttt{$\langle$ENV$\rangle$}}$ $\textit{a}_{\texttt{$\langle$ENV$\rangle$}}$ $\textit{picture}_{\texttt{$\langle$ENV$\rangle$}}$ $\textit{of}_{\texttt{$\langle$ENV$\rangle$}}$ $\textit{a}_{\texttt{$\langle$ENV$\rangle$}}$ $\textit{horse}_{\texttt{$\langle$ENV$\rangle$}}$ $\langle$ CHI $\rangle$ $\textit{my}_{\texttt{$\langle$LAN$\rangle$}}$ $\textit{favorite}_{\texttt{$\langle$LAN$\rangle$}}$ $\textit{animal}_{\texttt{$\langle$LAN$\rangle$}}$ $\textit{is}_{\texttt{$\langle$LAN$\rangle$}}$ $\textit{the}_{\texttt{$\langle$LAN$\rangle$}}$ . Right: among all, only one of them (head 8 of layer 7) is identified as an aggregate head, where information flows from $\textit{horse}_{\texttt{$\langle$ENV$\rangle$}}$ to the current position, encouraging the model to predict $\textit{horse}_{\texttt{$\langle$LAN$\rangle$}}$ as the next token.
Figure 1: Illustration of the symbol grounding mechanism through information aggregation. Lighter colors denote more salient attention, quantified by saliency scores, i.e., gradient $\times$ attention contributions to the loss (Wang et al., 2023). When predicting the next token, aggregate heads (Bick et al., 2025) emerge to exclusively link environmental tokens (visual or situational context; $\langle$ ENV $\rangle$ ) to linguistic tokens (words in text; $\langle$ LAN $\rangle$ ). These heads provide a mechanistic pathway for symbol grounding by mapping external environmental evidence into its linguistic form.
We quantify the level of grounding using surprisal: specifically, we compare how easily the model predicts a linguistic token ( $\langle$ LAN $\rangle$ ) when its matching environmental token ( $\langle$ ENV $\rangle$ ) is present versus when unrelated cues are given instead. A lower surprisal in the former condition indicates that the model has learned to align environmental grounds with linguistic forms. We find that LMs do learn to ground: the presence of environmental tokens consistently reduces surprisal for their linguistic counterparts, in a way that simple co-occurrence statistics cannot fully explain. To study the underlying mechanisms, we apply saliency analysis (Wang et al., 2023) and the tuned lens (Belrose et al., 2023), which converge on the result that grounding relations are concentrated in the middle layers of the network. Further analysis of attention heads reveals patterns consistent with the aggregate mechanism (Bick et al., 2025), where attention heads support the prediction of linguistic forms by retrieving their environmental grounds in the context.
Finally, we demonstrate that these findings generalize beyond the minimal CHILDES data and Transformer models. They appear in a multimodal setting with the Visual Dialog dataset (Das et al., 2017), and in state-space models (SSMs) such as Mamba-2 (Dao & Gu, 2024). In contrast, we do not observe grounding in unidirectional LSTMs, consistently with their sequential state compression and lack of content-addressable retrieval. Taken together, our results show that symbol grounding can mechanistically emerge in autoregressive LMs, while also delineating the architectural conditions under which it can arise.
## 2 Related Work
### 2.1 Language Grounding
Referential grounding has long been framed as the lexicon acquisition problem: how words map to referents in the world (Harnad, 1990; Gleitman & Landau, 1994; Clark, 1995). Early work focused on word-to-symbol mappings, designing learning mechanisms that simulate childrenâs lexical acquisition and explain psycholinguistic phenomena (Siskind, 1996; Regier, 2005; Goodman et al., 2007; Fazly et al., 2010). Subsequent studies incorporated visual grounding, first by aligning words with object categories (Roy & Pentland, 2002; Yu, 2005; Xu & Tenenbaum, 2007; Yu & Ballard, 2007; Yu & Siskind, 2013), and later by mapping words to richer visual features (Qu & Chai, 2010; Mao et al., 2019; 2021; Pratt et al., 2020). More recently, large-scale VLMs trained with paired textâimage supervision have advanced grounding to finer levels of granularity, achieving region-level (Li et al., 2022; Ma et al., 2023; Chen et al., 2023; You et al., 2024; Wang et al., 2024) and pixel-level (Xia et al., 2024; Rasheed et al., 2024; Zhang et al., 2024b) grounding, with strong performance on referring expression comprehension (Chen et al., 2024a).
Recent work suggests that grounding emerges as a property of VLMs trained without explicit supervision, with evidence drawn from attention-based spatial localization (Cao et al., 2025; Bousselham et al., 2024) and cross-modal geometric correspondences (Schnaus et al., 2025). However, all prior work focused exclusively on static final-stage models, overlooking the training trajectory, a crucial aspect for understanding when and how grounding emerges. In addition, existing work has framed grounding through correlations between visual and textual signals, diverging from the definition by Harnad (1990), which emphasizes causal links from symbols to meanings. To address these issues, we systematically examine learning dynamics throughout the training process, applying causal interventions to probe model internals and introducing control groups to enable rigorous comparison.
### 2.2 Emergent Capabilities and Learning Dynamics of LMs
A central debate concerns whether larger language models exhibit genuinely new behaviors: Wei et al. (2022) highlight abrupt improvements in tasks, whereas later studies argue such effects are artifacts of thresholds or in-context learning dynamics (Schaeffer et al., 2023; Lu et al., 2024). Beyond end performance, developmental analyses show that models acquire linguistic abilities in systematic though heterogeneous orders with variability across runs and checkpoints (Sellam et al., 2021; Blevins et al., 2022; Biderman et al., 2023; Xia et al., 2023; van der Wal et al., 2025). Psychology-inspired perspectives further emphasize controlled experimentation to assess these behaviors (Hagendorff, 2023), and comparative studies reveal both parallels and divergences between machine and human language learning (Chang & Bergen, 2022; Evanson et al., 2023; Chang et al., 2024; Ma et al., 2025). At a finer granularity, hidden-loss analyses identify phase-like transitions (Kangaslahti et al., 2025), while distributional studies attribute emergence to stochastic differences across training seeds (Zhao et al., 2024). Together, emergent abilities are not sharp discontinuities but probabilistic outcomes of developmental learning dynamics. Following this line of work, we present a probability- and model internalsâbased analysis of how symbol grounding emerges during language model training.
### 2.3 Mechanistic Interpretability of LMs
Mechanistic interpretability has largely focused on attention heads in Transformers (Elhage et al., 2021; Olsson et al., 2022; Meng et al., 2022; Bietti et al., 2023; Lieberum et al., 2023; Wu et al., 2025a). A central line of work established that induction heads emerge to support in-context learning (ICL; Elhage et al., 2021; Olsson et al., 2022), with follow-up studies tracing their training dynamics (Bietti et al., 2023) and mapping factual recall circuits (Meng et al., 2022). At larger scales, Lieberum et al. (2023) identified specialized content-gatherer and correct-letter heads, and Wu et al. (2025a) showed that a sparse set of retrieval heads is critical for reasoning and long-context performance. Relatedly, Wang et al. (2023) demonstrated that label words in demonstrations act as anchors: early layers gather semantic information into these tokens, which later guide prediction. Based on these insights, Bick et al. (2025) proposed that retrieval is implemented through a coordinated gather-and-aggregate (G&A) mechanism: some heads collect content from relevant tokens, while others aggregate it at the prediction position. Other studies extended this line of work by analyzing failure modes and training dynamics (Wiegreffe et al., 2025) and contrasting retrieval mechanisms in Transformers and SSMs (Arora et al., 2025). Whereas prior analyses typically investigate ICL with repeated syntactic or symbolic formats, our setup requires referential alignment between linguistic forms and their environmental contexts, providing a complementary testbed for naturalistic language grounding.
## 3 Method
Table 1: Training and test examples across datasets with target word book. The training examples combine environmental tokens ( $\langle$ ENV $\rangle$ ; shaded) with linguistic tokens ( $\langle$ LAN $\rangle$ ). Test examples are constructed with either matched (book) or mismatched (toy) environmental contexts, paired with corresponding linguistic prompts. Note that in child-directed speech and caption-grounded dialogue, book ${}_{\texttt{$\langle$ENV$\rangle$}}$ and book ${}_{\texttt{$\langle$LAN$\rangle$}}$ are two distinct tokens received by LMs.
| Child-Directed Speech | tticblue!10 $\langle$ CHI $\rangle$ takes book from mother | $\langle$ CHI $\rangle$ whatâs that $\langle$ MOT $\rangle$ a book in it ⌠| tticblue!10 $\langle$ CHI $\rangle$ asked for a new book | tticblue!10 $\langle$ CHI $\rangle$ asked for a new toy | $\langle$ CHI $\rangle$ I love this |
| --- | --- | --- | --- | --- | --- |
| Caption-Grounded Dialogue | tticblue!10 a dog appears to be reading a book with a full bookshelf behind | $\langle$ Q $\rangle$ can you tell what book itâs reading $\langle$ A $\rangle$ the marriage of true minds by stephen evans | tticblue!10 this is a book | tticblue!10 this is a toy | $\langle$ Q $\rangle$ can you name this object $\langle$ A $\rangle$ |
| Image-Grounded Dialogue | tticblue!10
<details>
<summary>figs/data/book-train.jpg Details</summary>

### Visual Description
## Photograph: Dog with Book in Library Setting
### Overview
A black-and-white speckled dog sits on a polished wooden floor, facing a yellow book titled *"The Marriage of True Minds"* by Stephen Evans. The dogâs head is tilted slightly toward the book, with its ears perked. Behind the dog, bookshelves filled with books are visible, including titles related to animal rights (e.g., *"Animal Rights / The Issues," "Bears," "Wild Animals"*).
### Components/Axes
- **Foreground**:
- **Dog**: Black-and-white speckled coat, pink nose, dark eyes, alert posture.
- **Book**: Yellow cover with bold black text: *"THE MARRIAGE OF TRUE MINDS"* (title), *"Stephen Evans"* (author). A small illustration of a red candle is visible near the bottom of the cover.
- **Background**:
- **Bookshelves**: Light-colored wooden shelves with books arranged vertically. Labels on shelves include:
- *"ANIMAL RIGHTS / The Issues / The Movement"* (top shelf).
- *"Wild Animals," "Bears," "Sanders," "Almost Human"* (middle shelves).
### Detailed Analysis
- **Text Extraction**:
- Book title: *"The Marriage of True Minds"* (Stephen Evans).
- Shelf labels: *"ANIMAL RIGHTS / The Issues / The Movement," "Wild Animals," "Bears," "Sanders," "Almost Human."*
- No other legible text or numerical data present.
### Key Observations
- The dogâs positioning suggests a staged, anthropomorphic scene, as if the dog is "reading" the book.
- The bookâs theme (love/intellect) contrasts with the animal rights titles in the background, potentially implying a narrative about human-animal relationships.
- No numerical data, charts, or diagrams are present.
### Interpretation
The image likely serves as a creative or symbolic representation, juxtaposing the dogâs curiosity with the bookâs philosophical themes. The animal rights titles in the background may hint at the ownerâs interests or the dogâs role as a companion in a household engaged with ethical or intellectual pursuits. The absence of explicit data suggests the image prioritizes visual storytelling over factual or analytical content.
**Note**: No numerical or structured data (e.g., charts, tables) is present in the image. All textual elements are transcribed as described.
</details>
| $\langle$ Q $\rangle$ can you tell what book itâs reading $\langle$ A $\rangle$ the marriage of true minds by stephen evans | tticblue!10
<details>
<summary>figs/data/book-test.jpg Details</summary>

### Visual Description
## Photograph: Wooden Bookshelf with Decorative Items
### Overview
The image depicts a large, dark-stained wooden bookshelf against a yellow wall. The shelves are densely packed with books, framed photographs, and small decorative objects. A model ship is placed atop the bookshelf, and the arrangement suggests a personal library or study space. No readable text is visible on the books, photos, or decorative items.
### Components/Axes
- **Bookshelf Structure**:
- Three vertical sections with upper cabinets (no visible handles) and lower open shelves.
- Lower section includes drawers with dark metal handles.
- **Decorative Items**:
- Framed photographs (small, rectangular, placed on shelves and drawers).
- Stacked books (various sizes, some with visible spines but unreadable titles).
- Small figurines (e.g., a toy car, a small animal figurine).
- A model ship (wooden, multi-masted, positioned on the top shelf).
- **Wall**: Mustard-yellow color, plain with no additional decor.
### Detailed Analysis
- **Books**:
- Spines visible but titles indiscernible due to resolution.
- Colors range from red, blue, green, to black, suggesting diverse genres or authors.
- **Photographs**:
- Framed in simple black or metallic frames.
- Placed on lower shelves and drawers, suggesting personal significance.
- **Decorative Objects**:
- Toy car (blue and yellow, placed on the bottom shelf).
- Small animal figurine (white, near the center of the bookshelf).
- **Model Ship**:
- Positioned on the top shelf, centered.
- No visible text or markings on the ship.
### Key Observations
1. **No Readable Text**: No labels, titles, or inscriptions are legible in the image.
2. **Symmetry and Organization**: Books are arranged vertically, with decorative items interspersed for aesthetic balance.
3. **Color Contrast**: Dark wood of the bookshelf contrasts with the yellow wall and colorful books.
4. **Personalization**: Framed photos and figurines indicate the space is curated for personal use.
### Interpretation
The bookshelf serves as a functional and decorative element, blending literature with personal memorabilia. The absence of readable text suggests the focus is on visual appeal rather than cataloging. The model ship and framed photos imply a narrative of travel, family, or hobbies, while the toy car adds a playful touch. The yellow wall enhances the warmth of the wooden tones, creating a cohesive, inviting atmosphere.
**Note**: No textual data (e.g., titles, labels) is extractable from the image. All descriptions are based on visible spatial and aesthetic elements.
</details>
| tticblue!10
<details>
<summary>figs/data/book-test-control.jpg Details</summary>

### Visual Description
## Photograph: Wooden Cabinet with Glass Display Sections
### Overview
The image depicts a large, dark-stained wooden cabinet with multiple glass-fronted sections. Each section contains indistinct items, likely collectibles or decorative objects. A small model ship is placed atop the cabinet. The background wall is painted a warm yellow, and the cabinet occupies the majority of the frame.
### Components/Axes
- **Cabinet Structure**:
- Dark wood finish with recessed paneling.
- Glass-fronted compartments (at least three visible).
- Metal hinges and handles on lower sections.
- **Model Ship**:
- Positioned centrally on the cabinetâs top surface.
- Wooden construction with three masts and rigging.
- **Items in Display Sections**:
- Blurred contents visible through glass (no discernible labels or text).
- Objects include figurines, books, and possibly mechanical or glassware items.
### Detailed Analysis
- **Cabinet Design**:
- Uniform dark brown wood with lighter wood inlays.
- Glass panels are rectangular, framed by dark metal.
- Lower sections have horizontal drawers or doors with brass handles.
- **Model Ship**:
- Scale model, approximately 1:100 ratio.
- Rigging details suggest a historical sailing vessel (e.g., 18th/19th century).
- **Display Items**:
- No visible text, labels, or branding.
- Items appear curated, suggesting personal or thematic significance.
### Key Observations
1. **No Textual Elements**: The image contains no legible text, labels, or legends.
2. **Uniformity of Sections**: All glass-fronted compartments share identical design and spacing.
3. **Decorative Focus**: The cabinet and model ship emphasize aesthetic display over functional storage.
### Interpretation
The cabinet serves as a curated display unit, likely for personal or historical artifacts. The absence of text suggests the items are valued for their visual or sentimental appeal rather than informational content. The model ship atop the cabinet may indicate a nautical theme or personal interest in maritime history. The blurred contents of the glass sections imply the items are intentionally obscured, possibly to maintain privacy or focus attention on the cabinetâs craftsmanship.
No data trends, numerical values, or structured datasets are present. The image prioritizes visual composition over technical or analytical content.
</details>
| what do we have here? |
### 3.1 Dataset and Tokenization
To capture the emergent grounding from multimodal interactions, we design a minimal testbed with a custom word-level tokenizer, in which every lexical item is represented in two corresponding forms: one token that appears in non-verbal descriptions (e.g., a book in the scene description) and another that appears in utterances (e.g., book in speech). We refer to these by environmental ( $\langle$ ENV $\rangle$ ) and linguistic tokens ( $\langle$ LAN $\rangle$ ), respectively. For instance, book ${}_{\texttt{$\langle$ENV$\rangle$}}$ and book ${}_{\texttt{$\langle$LAN$\rangle$}}$ are treated as distinct tokens with separate integer indices; that is, the tokenization provides no explicit signal that these tokens are related, so any correspondence between them must be learned during training rather than inherited from their surface form. We instantiate this framework in three datasets, ranging from child-directed speech transcripts to image-based dialogue.
Child-directed speech. The Child Language Data Exchange System (CHILDES; MacWhinney, 2000) provides transcripts of speech enriched with environmental annotations. See the manual for data usage: https://talkbank.org/0info/manuals/CHAT.pdf We use the spoken utterances as the linguistic tokens ( $\langle$ LAN $\rangle$ ) and the environmental descriptions as the environment tokens ( $\langle$ ENV $\rangle$ ). The environmental context is drawn from three annotation types:
- Local events: simple events, pauses, long events, or remarks interleaved with the transcripts.
- Action tiers: actions performed by the speaker or listener (e.g., %act: runs to toy box). These also include cases where an action replaces speech (e.g., 0 [% kicks the ball]).
- Situational tiers: situational information tied to utterances or to larger contexts (e.g., %sit: dog is barking).
Caption-grounded dialogue. The Visual Dialog dataset (Das et al., 2017) pairs MSCOCO images (Lin et al., 2014) with sequential question-answering based multi-turn dialogues that exchange information about each image. Our setup uses MSCOCO captions as the environmental tokens ( $\langle$ ENV $\rangle$ ) and the dialogue turns form the linguistic tokens ( $\langle$ LAN $\rangle$ ). In this pseudo cross-modal setting, textual descriptions of visual scenes ground natural conversational interaction. Compared to CHILDES, this setup introduces richer semantics and longer utterances, while still using text-based inputs for both token types, thereby offering a stepping stone toward grounding in fully visual contexts.
Image-grounded dialogue. To move beyond textual proxies, we consider an image-grounded dialogue setup, using the same dataset as the caption-grounded dialogue setting. Here, a frozen vision transformer (ViT; Dosovitskiy et al., 2020) directly tokenizes each RGB image into patch embeddings, with each embedding treated as an $\langle$ ENV $\rangle$ token, analogously to the visual tokens in modern VLMs. We use DINOv2 (Oquab et al., 2024) as our ViT tokenizer, as it is trained purely on vision data without auxiliary text supervision (in contrast to models like CLIP; Radford et al., 2021), thereby ensuring that environmental tokens capture only visual information. The linguistic tokens ( $\langle$ LAN $\rangle$ ) remain unchanged from the caption-grounded dialogue setting, resulting in a realistic multimodal interaction where conversational utterances are grounded directly in visual input.
### 3.2 Evaluation Protocol
We assess symbol grounding with a contrastive test that asks whether a model assigns a higher probability to the correct linguistic token when the matching environmental token is in context, following the idea of priming in psychology. This evaluation applies uniformly across datasets (Table 1): in CHILDES and caption-grounded dialogue, environmental priming comes from descriptive contexts; in image-grounded dialogue, from ViT-derived visual tokens. We compare the following conditions:
- Match (experimental condition): The context contains the corresponding $\langle$ ENV $\rangle$ token for the target word, and the model is expected to predict its $\langle$ LAN $\rangle$ counterpart.
- Mismatch (control condition): The context is replaced with a different $\langle$ ENV $\rangle$ token. The model remains tasked with predicting the same $\langle$ LAN $\rangle$ token; however, in the absence of corresponding environmental cues, its performance is expected to be no better than chance.
For example (first row in Table 1), when evaluating the word $\textit{book}_{\texttt{$\langle$LAN$\rangle$}}$ , the input context is
$$
\displaystyle\vskip-2.0pt\langle\textit{CHI}\rangle\textit{ asked}_{\texttt{$\langle$ENV$\rangle$}}\textit{ for}_{\texttt{$\langle$ENV$\rangle$}}\textit{ a}_{\texttt{$\langle$ENV$\rangle$}}\textit{ new}_{\texttt{$\langle$ENV$\rangle$}}\textit{ book}_{\texttt{$\langle$ENV$\rangle$}}\textit{ }\langle\textit{CHI}\rangle\textit{ I}_{\texttt{$\langle$LAN$\rangle$}}\textit{ love}_{\texttt{$\langle$LAN$\rangle$}}\textit{ this}_{\texttt{$\langle$LAN$\rangle$}}\textit{ }\underline{\hskip 30.00005pt},\vskip-2.0pt \tag{1}
$$
where the model is expected to predict $\textit{book}_{\texttt{$\langle$LAN$\rangle$}}$ for the blank, and the role token $\langle$ CHI $\rangle$ indicates the involved speaker or actorâs role being a child. In the control (mismatch) condition, the environmental token box ${}_{\texttt{$\langle$ENV$\rangle$}}$ is replaced by another valid noun such as toy ${}_{\texttt{$\langle$ENV$\rangle$}}$ .
Context templates. For a target word $v$ with linguistic token $v_{\texttt{$\langle$LAN$\rangle$}}$ and environmental token $v_{\texttt{$\langle$ENV$\rangle$}}$ , we denote $\overline{C}_{v}$ as a set of context templates of $v$ . For example, when $v=\textit{book}$ , a $\overline{c}\in\overline{C}_{v}$ can be
$$
\displaystyle\vskip-2.0pt\langle\textit{CHI}\rangle\textit{ asked}_{\texttt{$\langle$ENV$\rangle$}}\textit{ for}_{\texttt{$\langle$ENV$\rangle$}}\textit{ a}_{\texttt{$\langle$ENV$\rangle$}}\textit{ new}_{\texttt{$\langle$ENV$\rangle$}}\textit{ }\texttt{[FILLER]}\textit{ }\langle\textit{CHI}\rangle\textit{ I}_{\texttt{$\langle$LAN$\rangle$}}\textit{ love}_{\texttt{$\langle$LAN$\rangle$}}\underline{\hskip 30.00005pt},\vskip-2.0pt \tag{2}
$$
where [FILLER] is to be replaced with an environmental token, and the blank indicates the expected prediction as in Eq. (1). In the match condition, the context $\overline{c}(v)$ is constructed by replacing [FILLER] with $v_{\texttt{$\langle$ENV$\rangle$}}$ in $\overline{c}$ . In the mismatch condition, the context $\overline{c}(u)$ uses $u_{\texttt{$\langle$ENV$\rangle$}}(u\neq v)$ as the filler, while the prediction target remains $v_{\texttt{$\langle$LAN$\rangle$}}$ .
For the choices of $v$ and $u$ , we construct the vocabulary $V$ with 100 nouns from the MacArthurâBates Communicative Development Inventories (Fenson et al., 2006) that occur frequently in our corpus. Each word serves once as the target, with the remaining $M=99$ used to construct mismatched conditions. For each word, we create $N=10$ context templates, which contain both $\langle$ ENV $\rangle$ and $\langle$ LAN $\rangle$ tokens. Details of the vocabulary and context template construction can be found in the Appendix A.
Grounding information gain. Following prior work, we evaluate how well an LM learns a word using the mean surprisal over instances. The surprisal of a word $w$ given a context $c$ is defined as $s_{\boldsymbol{\theta}}(w\mid c)=-\log P_{\boldsymbol{\theta}}(w\mid c),$ where $P_{\boldsymbol{\theta}}(w\mid c)$ denotes the probability, under an LM parameterized by ${\boldsymbol{\theta}}$ , that the next word is $w$ conditioned on the context $c$ . Here, $s_{\boldsymbol{\theta}}(w\mid c)$ quantifies the unexpectedness of predicting $w$ , or the pointwise information carried by $w$ conditioned on the context.
The grounding information gain $G_{\boldsymbol{\theta}}(v)$ for $v$ is defined as
| | $\displaystyle G_{\boldsymbol{\theta}}(v)=\frac{1}{N}\sum_{n=1}^{N}\left(\frac{1}{M}\sum_{u\neq v}^{M}\Big[s_{\boldsymbol{\theta}}\left(v_{\texttt{$\langle$LAN$\rangle$}}\mid\overline{c}_{n}\left(u_{\texttt{$\langle$ENV$\rangle$}}\right)\right)-s_{\boldsymbol{\theta}}\left(v_{\texttt{$\langle$LAN$\rangle$}}\mid\overline{c}_{n}\left(v_{\texttt{$\langle$ENV$\rangle$}}\right)\right)\Big]\right).$ | |
| --- | --- | --- |
This is a sample-based estimation of the expected log-likelihood ratio between the match and mismatch conditions
| | $\displaystyle G_{\boldsymbol{\theta}}(v)=\mathbb{E}_{c,u}\left[\log\frac{P_{\boldsymbol{\theta}}(v_{\texttt{$\langle$LAN$\rangle$}}\mid c,v_{\texttt{$\langle$ENV$\rangle$}})}{P_{\boldsymbol{\theta}}(v_{\texttt{$\langle$LAN$\rangle$}}\mid c,u_{\texttt{$\langle$ENV$\rangle$}})}\right],$ | |
| --- | --- | --- |
which quantifies how much more information the matched ground provides for predicting the linguistic form, compared to a mismatched one. A positive $G_{\boldsymbol{\theta}}(v)$ indicates that the matched environmental token increases the predictability of its linguistic form. We report $G_{\boldsymbol{\theta}}=\frac{1}{|V|}\sum_{v\in V}G_{\boldsymbol{\theta}}(v)$ , and track $G_{{\boldsymbol{\theta}}^{(t)}}$ across training steps $t$ to analyze how grounding emerges over time.
### 3.3 Model Training
We train LMs from random initialization, ensuring that no prior linguistic knowledge influences the results. Our training uses the standard causal language modeling objective, as in most generative LMs. To account for variability, we repeat all experiments with 5 random seeds, randomizing both model initialization and corpus shuffle order. Our primary architecture is Transformer (Vaswani et al., 2017) in the style of GPT-2 (Radford et al., 2019) with 18, 12, and 4 layers, with all of them having residual connections. We extend the experiments to 4-layer unidirectional LSTMs (Hochreiter & Schmidhuber, 1997) with no residual connections, as well as 12- and 4-layer state-space models (specifically, Mamba-2; Dao & Gu, 2024). For fair comparison with LSTMs, the 4-layer Mamba-2 models do not involve residual connections, whereas the 12-layer ones do. For multimodal settings, while standard LLaVA (Liu et al., 2023) uses a two-layer perceptron to project ViT embeddings into the language model, we bypass this projection in our case and directly feed the DINOv2 representations into the LM. We obtain the developmental trajectory of the model by saving checkpoints at various training steps, sampling more heavily from earlier steps, following Chang & Bergen (2022).
## 4 Behavioral Evidence
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Graph: Surprisal vs Training Steps
### Overview
The image depicts a line graph comparing two data series ("Match" and "Mismatch") across 20,000 training steps. The y-axis measures "Surprisal" (log probability), while the x-axis represents training progression. Two distinct trends emerge: a sharp decline in "Match" surprisal followed by stabilization, and a gradual decline in "Mismatch" surprisal with minimal variability.
### Components/Axes
- **Y-axis**: "Surprisal" (log probability), scaled from 5.0 to 12.5 in increments of 2.5
- **X-axis**: "Training steps" (0 to 20,000), marked at 0, 10,000, and 20,000
- **Legend**:
- Blue line: "Match"
- Orange line: "Mismatch"
- **Placement**: Legend positioned in the top-right quadrant of the plot area
### Detailed Analysis
1. **Match (Blue Line)**:
- Initial value: ~12.5 at 0 steps
- Sharp decline to ~7.5 by ~2,500 steps
- Gradual decrease to ~5.0 by 20,000 steps
- Variability: Âą0.2 around the trendline
2. **Mismatch (Orange Line)**:
- Initial value: ~10.0 at 0 steps
- Steady decline to ~7.5 by ~10,000 steps
- Minimal change after 10,000 steps (~7.5â7.7)
- Variability: Âą0.1 around the trendline
### Key Observations
- **Convergence**: Both lines converge near 7.5 surprisal by 10,000 steps
- **Rate of Change**: "Match" shows a steeper initial decline (Î~5.0 over 2,500 steps) vs "Mismatch" (Î~2.5 over 10,000 steps)
- **Stability**: "Mismatch" demonstrates lower variance (Âą0.1) compared to "Match" (Âą0.2)
### Interpretation
The data suggests differential learning dynamics between matched and mismatched conditions:
1. **Match Condition**: Rapid initial reduction in surprisal indicates effective pattern recognition/learning, with diminishing returns after 2,500 steps
2. **Mismatch Condition**: Slower, more stable decline suggests either:
- Inherent difficulty in learning mismatched patterns
- Different optimization landscape characteristics
3. **Convergence Point**: Both conditions reach similar surprisal levels by 10,000 steps, implying comparable asymptotic performance despite divergent learning trajectories
The graph highlights the importance of data alignment in training efficiency, with matched conditions achieving faster initial learning but both approaches eventually reaching similar performance ceilings.
</details>
(a) 12-layer Transformer.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Graph: Surprisal vs Training Steps
### Overview
The image depicts a line graph comparing two data series ("Match" and "Mismatch") across 20,000 training steps. Both lines show decreasing trends in "Surprisal" values, with distinct initial trajectories and convergence patterns.
### Components/Axes
- **Y-axis (Surprisal)**: Labeled "Surprisal" with values ranging from 5.0 to 12.5 in increments of 2.5.
- **X-axis (Training steps)**: Labeled "Training steps" with values from 0 to 20,000 in increments of 10,000.
- **Legend**: Located in the top-right corner, with:
- **Blue line**: "Match"
- **Orange line**: "Mismatch"
### Detailed Analysis
1. **Initial Values (0 training steps)**:
- "Match" (blue): Starts at ~12.5
- "Mismatch" (orange): Starts at ~11.5
2. **Early Decline (0â5,000 steps)**:
- "Match" drops sharply from 12.5 to ~7.5
- "Mismatch" declines gradually from 11.5 to ~7.0
3. **Midpoint (10,000 steps)**:
- Both lines converge near ~6.5
4. **Late Training (20,000 steps)**:
- "Match": ~5.2
- "Mismatch": ~5.0
### Key Observations
- **Convergence**: Both lines merge near 10,000 steps and remain parallel thereafter.
- **Rate of Change**: "Match" shows a steeper initial decline compared to "Mismatch."
- **Stabilization**: Surprisal values plateau after ~15,000 steps for both conditions.
### Interpretation
The data suggests that:
1. **Learning Dynamics**: The rapid decline in "Match" surprisal indicates faster adaptation to predictable patterns, while "Mismatch" reflects slower learning from less predictable data.
2. **Model Robustness**: Convergence at later stages implies the model achieves similar generalization performance regardless of input type after sufficient training.
3. **Surprisal Thresholds**: The final surprisal values (~5.0â5.2) may represent the model's baseline uncertainty floor for both conditions.
No anomalies or outliers are observed. The graph demonstrates a clear trade-off between initial data predictability and long-term model performance.
</details>
(b) 4-layer Transformer.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Graph: Surprisal Trends in Match vs. Mismatch Conditions
### Overview
The image depicts a line graph comparing the "Surprisal" metric across two conditions ("Match" and "Mismatch") over 20,000 training steps. The graph shows distinct trends for each condition, with the "Match" line exhibiting a sharp decline followed by stabilization, while the "Mismatch" line remains relatively stable.
### Components/Axes
- **Y-Axis (Surprisal)**: Labeled "Surprisal," scaled from 0 to 12.5 in increments of 2.5.
- **X-Axis (Training Steps)**: Labeled "Training steps," scaled from 0 to 20,000 in increments of 10,000.
- **Legend**: Positioned in the top-right corner, with:
- **Blue line**: Labeled "Match"
- **Orange line**: Labeled "Mismatch"
- **Shaded Regions**: Light blue and orange bands around each line, likely representing variability or confidence intervals.
### Detailed Analysis
1. **Match (Blue Line)**:
- **Initial Value**: Starts at ~12.5 (with uncertainty Âą0.5) at 0 training steps.
- **Trend**: Drops sharply to ~5.0 by ~5,000 steps, then stabilizes with minor fluctuations (~4.5â5.5) between 10,000 and 20,000 steps.
- **Shaded Region**: Narrower than the Mismatch line, suggesting lower variability in later stages.
2. **Mismatch (Orange Line)**:
- **Initial Value**: Begins at ~7.5 (with uncertainty Âą0.5) at 0 training steps.
- **Trend**: Remains stable between ~7.0 and ~7.8 across all training steps, with no significant upward or downward movement.
- **Shaded Region**: Broader than the Match line, indicating higher variability.
### Key Observations
- The "Match" condition shows a steep decline in surprisal during early training, followed by stabilization.
- The "Mismatch" condition exhibits no meaningful change in surprisal over time.
- The shaded regions suggest that variability in the "Match" condition decreases as training progresses, while the "Mismatch" condition maintains consistent uncertainty.
- No crossover or interaction between the two lines is observed.
### Interpretation
The data suggests that the "Match" condition involves a learning process where surprisal (a measure of prediction error or uncertainty) decreases as the system adapts to predictable patterns. In contrast, the "Mismatch" condition lacks such a learning effect, as surprisal remains constant, implying no adaptation to unpredictable or irrelevant stimuli. The narrowing shaded region for "Match" may reflect increased confidence in predictions over time, while the broader region for "Mismatch" indicates persistent uncertainty. These trends align with theories of associative learning, where repeated exposure to predictable stimuli reduces cognitive surprise.
</details>
(c) 4-layer Mamba 2.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Graph: Surprisal vs. Training Steps
### Overview
The image depicts a line graph comparing two data series ("Match" and "Mismatch") across 20,000 training steps. Both lines show a sharp initial decline in surprisal values, followed by a plateau. The "Match" line (blue) starts slightly higher than the "Mismatch" line (orange) but converges with it as training progresses.
### Components/Axes
- **Y-axis (Surprisal)**: Ranges from 5.0 to 12.5 in increments of 2.5.
- **X-axis (Training steps)**: Spans 0 to 20,000 in increments of 10,000.
- **Legend**: Located in the top-right corner, with:
- **Blue line**: Labeled "Match"
- **Orange line**: Labeled "Mismatch"
### Detailed Analysis
- **Initial values (0 training steps)**:
- Both lines begin near **12.5** surprisal.
- The "Match" line peaks slightly higher (~12.7) before dropping.
- **Midpoint (10,000 steps)**:
- "Match": ~7.5 surprisal
- "Mismatch": ~8.0 surprisal
- **Final values (20,000 steps)**:
- Both lines plateau near **7.5** surprisal.
- **Trends**:
- "Match" declines faster initially (steeper slope) but flattens earlier.
- "Mismatch" declines more gradually, maintaining a slight lead until ~15,000 steps.
### Key Observations
1. Both data series exhibit a **rapid decrease in surprisal** during early training, followed by stabilization.
2. The "Match" line demonstrates a **sharper initial drop** compared to "Mismatch."
3. By 20,000 steps, the lines **converge**, suggesting diminishing differences between Match and Mismatch outcomes.
### Interpretation
The graph indicates that training reduces surprisal for both Match and Mismatch scenarios, implying the model becomes less uncertain or "surprised" over time. The convergence of the lines suggests that the distinction between Match and Mismatch outcomes weakens with prolonged training, potentially reflecting improved generalization or reduced sensitivity to input variations. The initial peak (~12.5) may represent baseline surprisal before training, while the plateau (~7.5) signifies the model's stabilized performance threshold.
</details>
(d) 4-layer LSTM.
Figure 2: Average surprisal of the experimental and control conditions over training steps.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Graph: Information Gain vs R² Values Over Training Steps
### Overview
The image depicts a line graph comparing two metricsâ**Information gain** (blue line) and **R² value** (orange line)âacross **20,000 training steps**. The graph includes a secondary y-axis for Information gain (right side) and a primary y-axis for R² values (left side). Both lines exhibit distinct trends, with the R² value peaking early and declining, while Information gain rises steadily after an initial dip.
---
### Components/Axes
- **X-axis**: "Training steps" (0 to 20,000, linear scale).
- **Left Y-axis**: "R² values" (0 to 0.8, linear scale).
- **Right Y-axis**: "Information gain" (0 to 6, linear scale).
- **Legend**: Located in the top-left corner, with:
- **Blue line**: "Information gain"
- **Orange line**: "R² value"
---
### Detailed Analysis
1. **R² Value (Orange Line)**:
- Starts near 0 at 0 training steps.
- Peaks sharply at ~5,000 steps (~0.45 R² value).
- Declines steadily to ~0.05 by 20,000 steps.
- Shaded orange region indicates uncertainty (standard error), narrowing as training progresses.
2. **Information Gain (Blue Line)**:
- Begins near 0 at 0 steps.
- Dips slightly below 1,000 steps.
- Rises steadily to ~2.5 by 20,000 steps.
- Shaded blue region shows increasing uncertainty over time.
3. **Inverse Relationship**:
- The R² value and Information gain exhibit an inverse correlation: as R² peaks early, Information gain remains low, then diverges as R² declines while Information gain increases.
---
### Key Observations
- **Early Overperformance**: R² value peaks at ~5,000 steps (~0.45), suggesting initial model improvement.
- **Divergence Post-5,000 Steps**: After the R² peak, Information gain becomes the dominant metric, rising to ~2.5 by 20,000 steps.
- **Uncertainty Trends**: Both metrics show increasing uncertainty (wider shaded regions) as training progresses, particularly for Information gain.
---
### Interpretation
The graph suggests a trade-off between model performance metrics during training:
- **Early Training**: High R² values indicate strong initial correlations, but Information gain remains low, possibly due to limited data exploration.
- **Later Training**: Declining R² values may signal overfitting or diminishing returns in predictive accuracy, while rising Information gain implies improved model efficiency or feature relevance.
- **Practical Implications**: The divergence highlights the importance of balancing accuracy (R²) with efficiency (Information gain) in model selection, especially in resource-constrained scenarios.
The inverse relationship raises questions about whether the model prioritizes accuracy early on but shifts toward efficiency as training matures, or if the metrics reflect competing objectives (e.g., memorization vs. generalization).
</details>
(a) 12-layer Transformer.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Graph: Training Steps vs. R² and Information Gain
### Overview
The image depicts a dual-axis line graph comparing two metrics across training steps: **R² values** (left y-axis) and **Information gain** (right y-axis). The x-axis represents training steps from 0 to 20,000. Two lines are plotted: a blue line for Information gain and an orange line for R² values. The legend is positioned in the top-left corner.
---
### Components/Axes
- **X-axis**: "Training steps" (0 to 20,000, linear scale).
- **Left Y-axis**: "R² values" (0 to 0.8, linear scale).
- **Right Y-axis**: "Information gain" (0 to 6, linear scale).
- **Legend**:
- Blue line: "Information gain"
- Orange line: "R² value"
- **Secondary Axis**: Right y-axis for Information gain (blue line).
---
### Detailed Analysis
1. **R² Values (Orange Line)**:
- Starts near 0 at 0 training steps.
- Peaks at ~0.35 around 5,000 steps.
- Declines steadily to ~0.05 by 20,000 steps.
- Shaded region (confidence interval) widens slightly after 5,000 steps.
2. **Information Gain (Blue Line)**:
- Starts at 0 at 0 training steps.
- Increases monotonically, reaching ~2.5 by 20,000 steps.
- Slope flattens slightly after ~15,000 steps.
3. **Intersection Point**:
- The two lines cross near 10,000 steps, where R² â 0.2 and Information gain â 2.
---
### Key Observations
- **R² Divergence**: The orange line peaks early and then declines, suggesting diminishing returns in model performance (as measured by R²) after ~5,000 steps.
- **Information Gain Growth**: The blue line shows sustained improvement, indicating continued learning or feature relevance even as R² plateaus.
- **Metric Discrepancy**: R² and Information gain diverge significantly after 10,000 steps, highlighting potential limitations of R² for this task.
---
### Interpretation
- **Training Dynamics**: The graph suggests that while R² initially improves with training, it eventually degrades, possibly due to overfitting or noise in the data. In contrast, Information gain continues to rise, implying the model is still capturing meaningful patterns.
- **Metric Selection**: The divergence between R² and Information gain raises questions about the suitability of R² as a sole evaluation metric. Information gain may better reflect long-term learning in this context.
- **Anomaly**: The sharp peak in R² at 5,000 steps could indicate a temporary overfit or a specific feature alignment that later becomes irrelevant.
---
### Spatial Grounding
- **Legend**: Top-left corner, clearly associating colors with metrics.
- **Secondary Axis**: Right y-axis for Information gain, avoiding overlap with R² values.
- **Line Placement**: Blue (Information gain) consistently above orange (R²) after 10,000 steps.
---
### Content Details
- **R² Values**:
- Peak: ~0.35 (5,000 steps).
- Final Value: ~0.05 (20,000 steps).
- **Information Gain**:
- Final Value: ~2.5 (20,000 steps).
- Slope: ~0.000125 per step (linear approximation).
---
### Key Observations (Reiterated)
- R² values decline after 5,000 steps, while Information gain continues to rise.
- The two metrics cross at ~10,000 steps, signaling a shift in model behavior.
- Information gainâs sustained growth suggests it is a more reliable metric for this task.
</details>
(b) 4-layer Transformer.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Graph: Model Performance Metrics Over Training Steps
### Overview
The image depicts a line graph comparing two metricsâ**Information gain** and **R² value**âacross 20,000 training steps. The graph includes two y-axes: the left axis (orange) tracks R² values (0â0.8), and the right axis (blue) tracks Information gain (0â6). A legend in the top-left corner distinguishes the two metrics by color.
### Components/Axes
- **X-axis**: "Training steps" (0 to 20,000), with markers at 0, 10,000, and 20,000.
- **Left Y-axis**: "R² values" (0â0.8), labeled in orange.
- **Right Y-axis**: "Information gain" (0â6), labeled in blue.
- **Legend**: Top-left corner, with:
- **Blue line**: Information gain.
- **Orange line**: R² value.
### Detailed Analysis
1. **Information gain (blue line)**:
- Starts at 0 at step 0.
- Increases steadily, reaching approximately **4** by 10,000 steps.
- Plateaus slightly above 4 after 10,000 steps, with minor fluctuations.
- Final value at 20,000 steps: ~4.2.
2. **R² value (orange line)**:
- Begins at 0, rises sharply to a peak of **~0.3** at ~5,000 steps.
- Declines sharply after 5,000 steps, dropping to near 0 by 10,000 steps.
- Remains close to 0 for the remainder of training (10,000â20,000 steps).
### Key Observations
- **Divergence of metrics**: R² value peaks early (5,000 steps) and collapses, while Information gain continues to rise.
- **Stability**: Information gain stabilizes after 10,000 steps, suggesting diminishing returns in information acquisition.
- **Anomaly**: R² valueâs sharp decline after 5,000 steps contrasts with the sustained growth of Information gain.
### Interpretation
The graph suggests that the modelâs **R² value** (a measure of predictive accuracy) improves rapidly during initial training but plateaus and eventually degrades, indicating potential overfitting or saturation. Meanwhile, **Information gain** (a measure of new knowledge acquired) grows steadily, implying that the model continues to learn meaningful patterns even after R² stabilizes. This divergence highlights a trade-off: while R² reflects immediate performance, Information gain may better capture long-term learning dynamics. The sharp drop in R² after 5,000 steps warrants further investigationâit could signal data leakage, noise in the training process, or a mismatch between the modelâs capacity and the task complexity.
</details>
(c) 4-layer Mamba 2.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Graph: Training Performance Metrics
### Overview
The image depicts a line graph comparing two performance metrics ("Information gain" and "R² value") across training steps. The graph includes two y-axes: the left axis measures "R² values" (0â0.8), and the right axis measures "Information gain" (0â6). The x-axis represents "Training steps" (0â20,000). Two data series are plotted: a blue line for "Information gain" and an orange line for "R² value," with a shaded uncertainty region around the orange line.
### Components/Axes
- **X-axis**: "Training steps" (0â20,000), with tick marks at 0, 10,000, and 20,000.
- **Left Y-axis**: "R² values" (0â0.8), with increments of 0.2.
- **Right Y-axis**: "Information gain" (0â6), with increments of 2.
- **Legend**: Located in the top-right corner, with:
- Blue line: "Information gain"
- Orange line: "R² value"
- **Shaded Region**: Light orange area surrounding the orange line, indicating uncertainty in "R² value" estimates.
### Detailed Analysis
1. **R² Value (Orange Line)**:
- Starts at 0 at 0 training steps.
- Rises sharply to ~0.6 by 10,000 steps.
- Plateaus near 0.6â0.7 by 20,000 steps.
- Shaded uncertainty region widens slightly during the initial rise but narrows as the line plateaus.
2. **Information Gain (Blue Line)**:
- Starts near 0 at 0 training steps.
- Increases gradually, reaching ~0.1 by 20,000 steps.
- Remains relatively flat after ~5,000 steps.
3. **Axis Inconsistencies**:
- The right y-axis ("Information gain") scales to 6, but the blue line never exceeds ~0.1, suggesting a potential mismatch in axis scaling or data representation.
### Key Observations
- **R² Value**: Rapid improvement in early training steps, followed by saturation, indicating diminishing returns.
- **Information Gain**: Minimal improvement over training steps, suggesting limited sensitivity to additional training.
- **Uncertainty**: The shaded region around the orange line highlights variability in "R² value" estimates, particularly during the initial rise.
- **Axis Mismatch**: The right y-axis ("Information gain") scale (0â6) does not align with the blue line's actual values (~0â0.1), raising questions about visualization accuracy.
### Interpretation
The data suggests that the model's performance, as measured by "R² value," improves significantly during early training but plateaus by ~10,000 steps, indicating potential overfitting or model saturation. In contrast, "Information gain" shows minimal improvement, implying that additional training steps may not meaningfully enhance the model's ability to capture relevant information. The shaded uncertainty region around "R² value" underscores variability in early-stage performance estimates, which could reflect instability in the model's learning process. The discrepancy between the right y-axis scale and the blue line's values warrants further investigation, as it may misrepresent the magnitude of "Information gain" relative to "R² value."
</details>
(d) 4-layer LSTM.
Figure 3: Grounding information gain and its correlation to the co-occurrence of linguistic and environment tokens over training steps.
### 4.1 Behavioral Evidence of Emergent Grounding
In this section, we ask: Does symbol grounding emerge behaviorally in autoregressive LMs? We first test whether models show systematic surprisal reduction when predicting a linguistic token when its environmental counterpart is in context (Figure 2, where the gap between the lines represent the grounding information gain). For Transformers (Figures 2(a) and 2(b)) and Mamba-2 (Figure 2(c)), surprisal in the match condition decreases steadily while that in the mismatch condition enters a high-surprisal plateau early, indicating that the models leverage environmental context to predict the linguistic form. In contrast, the unidirectional LSTM (Figure 2(d)) shows little separation between the conditions, reflecting the absence of grounding. Overall, these results provide behavioral evidence of emergent grounding: in sufficiently expressive architectures (Transformers and Mamba-2), the correct environmental context reliably lowers surprisal for its linguistic counterpart, whereas LSTMs fail to exhibit this effect, marking an architectural boundary on where grounding can emerge.
### 4.2 Behavioral Effects Beyond Co-occurrence
A natural concern is that the surprisal reductions might be fully explainable by shallow statistics: the models might have simply memorized frequent co-occurrences of $\langle$ ENV $\rangle$ and $\langle$ LAN $\rangle$ tokens, without learning a deeper and more general mapping. We test this hypothesis by comparing the tokensâ co-occurrence with the grounding information gain in the child-directed speech data.
We define co-occurrence between the corresponding $\langle$ ENV $\rangle$ and $\langle$ LAN $\rangle$ tokens at the granularity of a 512-token training chunk. For each target word $v$ , we count the number of chunks in which both its $\langle$ ENV $\rangle$ and $\langle$ LAN $\rangle$ tokens appear. Following standard corpus-analysis practice, these raw counts are log-transformed. For each model checkpoint, we run linear regression between the log co-occurrence and the grounding information gain of words, obtaining an $R^{2}$ statistic as a function of training time.
Figure 3 shows the $R^{2}$ values (orange) alongside the grounding information gain (blue) for different architectures. In both the Transformer and Mamba-2, $R^{2}$ rises sharply at the early steps but then goes down, even if the grounding information gain continues increasing. These results suggest that grounding in Transformers and Mamba-2 cannot be fully accounted for by co-occurrence statistics: while models initially exploit surface co-occurrence regularities, later improvements in grounding diverge from these statistics, indicating reliance on richer and more complicated features acquired during training. In contrast, LSTM shows persistently increasing $R^{2}$ but little increase in grounding information gain over training steps, suggesting that it encodes co-occurrence but lacks the architectural mechanism to transform it into predictive grounding.
### 4.3 Visual Dialogue with Captions and Images
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Graph: Surprisal Trends Across Training Steps
### Overview
The image depicts a line graph comparing two data series labeled "Match" (blue) and "Mismatch" (orange) across 20,000 training steps. The y-axis measures "Surprisal" (a metric likely representing prediction uncertainty or error), while the x-axis tracks training progress. Both lines exhibit distinct trends, with the "Match" series declining sharply and the "Mismatch" series remaining relatively stable.
### Components/Axes
- **Y-axis (Surprisal)**: Labeled "Surprisal," scaled from 8 to 12 in increments of 1.
- **X-axis (Training steps)**: Labeled "Training steps," scaled from 0 to 20,000 in increments of 10,000.
- **Legend**: Positioned in the top-right corner, with:
- **Blue line**: "Match"
- **Orange line**: "Mismatch"
### Detailed Analysis
1. **Match (Blue Line)**:
- Starts at ~11.5 surprisal at 0 training steps.
- Declines steadily to ~7.0 surprisal by 20,000 steps.
- Exhibits minor fluctuations (e.g., slight dips at ~5,000 and ~15,000 steps).
- Shaded blue region (confidence interval?) narrows as training progresses.
2. **Mismatch (Orange Line)**:
- Begins at ~10.5 surprisal at 0 steps.
- Remains relatively flat (~10.0â11.0 surprisal) throughout training.
- Shows minor oscillations (e.g., peaks at ~3,000 and ~12,000 steps).
- Shaded orange region remains consistent in width.
### Key Observations
- The "Match" series demonstrates a **steady decline** in surprisal, suggesting improved performance or reduced uncertainty over training.
- The "Mismatch" series shows **no significant change**, implying stable performance or inherent difficulty in modeling mismatches.
- Both lines start with overlapping surprisal values (~10.5â11.5) but diverge sharply after ~5,000 steps.
### Interpretation
The data suggests that training effectively reduces surprisal for "Match" scenarios, likely due to the model learning patterns in these cases. In contrast, "Mismatch" surprisal remains high, indicating either:
1. **Inherent complexity** of mismatch patterns that the model cannot easily learn.
2. **Data imbalance**, where mismatch examples are underrepresented.
3. **Architectural limitations**, such as a model optimized for match prediction.
The divergence highlights a critical insight: training prioritizes match accuracy at the expense of mismatch performance, which may have implications for real-world applications requiring robust generalization. Further investigation into data distribution or model adjustments (e.g., balanced loss functions) could address this gap.
</details>
(a) Surprisal curves (w/ caption).
<details>
<summary>x13.png Details</summary>

### Visual Description
## Line Graph: Surprisal vs Training Steps
### Overview
The image depicts a line graph comparing two data series ("Match" and "Mismatch") across 300,000 training steps. The y-axis measures "Surprisal" (0-12), while the x-axis tracks "Training steps" (0-300,000). Two shaded confidence intervals accompany each line, indicating measurement uncertainty.
### Components/Axes
- **Y-axis (Surprisal)**: Linear scale from 8 to 12, with ticks at 8, 10, and 12.
- **X-axis (Training steps)**: Linear scale from 0 to 300,000, with ticks at 0, 150,000, and 300,000.
- **Legend**: Located in the top-right corner, with:
- **Blue line**: "Match" (solid blue)
- **Orange line**: "Mismatch" (solid orange)
- **Shaded regions**: Gray bands around each line represent 95% confidence intervals.
### Detailed Analysis
1. **Match (Blue Line)**:
- Starts at **10.0** (x=0) with a steep decline.
- Drops to **8.5** at 150,000 steps, then plateaus near **8.0** by 300,000 steps.
- Confidence interval narrows from Âą0.5 at x=0 to Âą0.2 at x=300,000.
2. **Mismatch (Orange Line)**:
- Begins at **10.0** (x=0) with a slight dip to **9.5** at 150,000 steps.
- Stabilizes at **9.3** by 300,000 steps, showing minimal change.
- Confidence interval remains consistent at Âą0.3 throughout.
### Key Observations
- **Match** demonstrates a **20% reduction** in surprisal over training steps, while **Mismatch** shows only a **7% reduction**.
- The **blue line** (Match) exhibits a **non-linear decline**, with the steepest drop occurring in the first 50,000 steps.
- **Mismatch** maintains a **flat trajectory** after 150,000 steps, suggesting diminishing returns in training.
### Interpretation
The data suggests that "Match" conditions (likely aligned with training objectives) lead to **significant performance improvement** over time, as measured by decreasing surprisal. The "Mismatch" condition shows **limited adaptation**, maintaining higher surprisal values despite training. The narrowing confidence intervals for "Match" indicate increasing measurement precision as training progresses. This pattern aligns with machine learning principles where model parameters better align with training data over iterations, reducing uncertainty in predictions.
</details>
(b) Surprisal curves (w/ image).
<details>
<summary>x14.png Details</summary>

### Visual Description
## Line Graph: Training Metrics Over Steps
### Overview
The image depicts a line graph comparing two metricsâ**Information gain** and **R² values**âacross **20,000 training steps**. The graph uses dual y-axes: the left axis measures **R² values** (0â1), and the right axis measures **Information gain** (0â6). Both metrics show distinct trends over training steps.
---
### Components/Axes
- **X-axis**: Training steps (0 to 20,000, linear scale).
- **Left Y-axis**: R² values (0 to 1.00, increments of 0.25).
- **Right Y-axis**: Information gain (0 to 6, increments of 2).
- **Legend**: Located in the top-left corner, with:
- **Blue line**: Information gain.
- **Orange line**: R² value.
---
### Detailed Analysis
1. **Information gain (Blue line)**:
- Starts at **0** at step 0.
- Increases steadily, reaching **~4.5** by 20,000 steps.
- Slope is consistent, with minor fluctuations (e.g., slight dips around 10,000 steps).
2. **R² value (Orange line)**:
- Begins at **0** at step 0.
- Rises sharply to **~0.75** by ~5,000 steps.
- Plateaus around **~0.5** after 10,000 steps, with minor oscillations.
---
### Key Observations
- **Information gain** increases linearly throughout training, suggesting continuous improvement in model utility.
- **R² value** peaks early (~5,000 steps) and then declines, indicating diminishing returns in predictive accuracy.
- The dual axes highlight a disconnect: while Information gain grows, R² stabilizes, implying the model may prioritize exploration over exploitation.
---
### Interpretation
The data suggests that as training progresses:
1. **Information gain** reflects the modelâs ability to reduce uncertainty in predictions, improving steadily.
2. **R² value** measures how well the model explains variance in the data, peaking early and then plateauing. This could indicate overfitting or saturation of simple patterns.
3. The divergence between the two metrics implies a trade-off: the model may be learning complex, less generalizable features (high Information gain) rather than refining core predictive relationships (R²).
This pattern is common in reinforcement learning, where exploration (driving Information gain) often outpaces immediate performance gains (R²). Further analysis could investigate whether the modelâs behavior aligns with expected convergence properties.
</details>
(c) $R^{2}$ and information gain (w/ caption).
<details>
<summary>x15.png Details</summary>

### Visual Description
## Line Graph: Information Gain vs R² Value Over Training Steps
### Overview
The graph depicts two metricsâ**Information gain** (blue line) and **R² value** (orange line)âplotted against **Training steps** (x-axis). The left y-axis represents **R² values** (0â1), while the right y-axis represents **Information gain** (0â3). The legend is positioned in the top-left corner, with blue and orange lines corresponding to their respective metrics.
---
### Components/Axes
- **X-axis**: Training steps (0 to 300,000, linear scale).
- **Left Y-axis**: R² values (0 to 1, linear scale).
- **Right Y-axis**: Information gain (0 to 3, linear scale).
- **Legend**: Top-left corner, with blue = Information gain, orange = R² value.
---
### Detailed Analysis
#### R² Value (Orange Line)
- **Initial trend**: Starts near **0.4** at 0 training steps, peaks at **~0.45** around 50,000 steps, then declines steadily.
- **Final value**: Stabilizes at **~0.25** by 300,000 steps.
- **Visual trend**: Slopes downward after the initial peak, with minor fluctuations.
#### Information Gain (Blue Line)
- **Initial trend**: Begins near **0** at 0 training steps, rises sharply to **~2.5** by 300,000 steps.
- **Visual trend**: Consistently upward-sloping with no plateaus or declines.
#### Key Intersection
- The two lines intersect at **~50,000 training steps**, where R² (~0.45) and Information gain (~0.45) are approximately equal.
---
### Key Observations
1. **R² value decreases** over time, suggesting diminishing explanatory power of the model as training progresses.
2. **Information gain increases** monotonically, indicating growing utility or efficiency in the modelâs outputs.
3. The divergence after 50,000 steps highlights a trade-off: higher information gain correlates with lower R² values.
---
### Interpretation
- **Model behavior**: The decline in R² may reflect overfitting or reduced generalization as the model becomes more specialized (higher information gain).
- **Practical implications**: While the model gains efficiency (information gain), its ability to explain variance in the data (R²) weakens, raising questions about long-term reliability.
- **Critical threshold**: The intersection at 50,000 steps suggests a potential optimal point for balancing these metrics, depending on the applicationâs priorities.
---
**Note**: All values are approximate, with uncertainty due to the graphâs resolution and lack of explicit error bars.
</details>
(d) $R^{2}$ and information gain (w/ image).
Figure 4: Average surprisal of the experimental and control conditions in caption- and image-grounded dialogue settings, as well as the grounding information gain and its correlation to the co-occurrence of linguistic and environment tokens over training steps. All results are from a 12-layer Transformer model on grounded dialogue data.
We next test whether the grounding effects observed in CHILDES generalize to multimodal dialogue, using the Visual Dialog dataset. In this setting, the environmental ground is supplied either by captions or by image features (Table 1). For caption-grounded dialogue, the mismatch context is constructed in the same way as for CHILDES (Equation 2). For image-grounded dialogue, mismatch contexts are generated via Stable Diffusion 2 (Rombach et al., 2022) âbased image inpainting, which re-generates the region defined by the ground-truth mask corresponding to the target wordâs referent.
We train 12-layer Transformers with 5 random seeds. Similarly as Figures 2(a) â 2(b) and Figures 3(a) â 3(b), when captions serve as the environmental ground, Transformers show a clear surprisal gap between match and mismatch conditions (Figure 4(a)), with the grounding information gain increasing steadily while $R^{2}$ peaks early and declines (Figure 4(c)). Directly using image as grounds yields the same qualitative pattern (Figures 4(b) and 4(d)), although the observed effect is smaller. Both settings confirm that emergent grounding cannot be fully explained by co-occurrence statistics.
Overall, our findings demonstrate that Transformers are able to exploit environmental grounds in various modalities to facilitate linguistic prediction. The smaller but consistent gains in the image-grounded case suggest that while grounding from visual tokens is harder, the same architectural dynamics identified in textual testbeds still apply.
## 5 Mechanistic Explanation
In this section, we provide a mechanistic and interpretable account of the previous observation. We focus on a 12-layer Transformer trained on CHILDES with 5 random seeds, and defer broader generalization to the discussion.
<details>
<summary>x16.png Details</summary>

### Visual Description
## Heatmap: Layer Activity Across Step Counts
### Overview
The image is a heatmap visualizing the distribution of values across 12 layers (x-axis) and step counts (y-axis). The color intensity ranges from dark purple (low values) to light orange (high values), with a legend indicating values from 0.05 to 0.30. The heatmap reveals concentrated high-value regions in specific layers and step ranges.
### Components/Axes
- **Y-axis (Steps)**: Labeled "Steps," with values ranging from 0 to 20,000 in increments of 150.
- **X-axis (Layer)**: Labeled "Layer," with categories 1 to 12.
- **Legend**: Positioned on the right, showing a gradient from dark purple (0.05) to light orange (0.30).
- **Gridlines**: Subtle gridlines separate cells for clarity.
### Detailed Analysis
- **Layer 1**: High values (light orange) appear at step counts of ~1,000, 15,000, and 20,000.
- **Layer 8**: Dominates with the highest values (light orange) concentrated around 15,000 steps.
- **Layer 2**: Moderate values (red/orange) at ~1,000 and 15,000 steps.
- **Layers 3â7, 9â12**: Mostly dark purple (low values), with occasional red/orange cells at lower step counts (~1,000â3,000).
- **Step Counts**: High-value regions are sparse, with most cells in the 0.05â0.10 range (dark purple).
### Key Observations
1. **Layer 8** is the most active, with the highest concentration of light orange cells.
2. **Layer 1** shows sporadic high values but lacks consistency.
3. **Layers 3â7, 9â12** exhibit minimal activity, with values predominantly below 0.10.
4. **Step Counts**: High values (0.20â0.30) are rare, occurring only in specific layer-step combinations.
### Interpretation
The heatmap suggests that **Layer 8** is the primary focus of activity, with a strong peak at ~15,000 steps. This could indicate a critical threshold or a specific event tied to this layer. **Layer 1** shows secondary significance but is less consistent. The majority of layers (3â7, 9â12) have negligible activity, implying they may be less relevant or underutilized in the context of the data. The sparse high-value regions suggest that the dataset is dominated by low-intensity interactions, with only a few layers and step counts driving the majority of the observed values. This pattern might reflect a bottleneck or a focal point in the system being analyzed.
</details>
(a) Saliency of layer-wise attention from environmental to linguistic tokens across training steps.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Line Graph: Surprisal Across Layers for Different Steps
### Overview
The graph depicts three descending lines representing "Surprisal" values across 12 layers for three distinct steps (5000, 10000, 20000). All lines show a general downward trend, with higher steps (numerically larger) associated with lower surprisal values. Shaded regions around each line indicate variability or confidence intervals.
### Components/Axes
- **X-axis (Layer)**: Labeled "Layer," with integer markers from 1 to 12.
- **Y-axis (Surprisal)**: Labeled "Surprisal," with a scale from 5 to 8.
- **Legend**: Located in the top-right corner, with three entries:
- Blue line: "step 5000"
- Orange line: "step 10000"
- Green line: "step 20000"
- **Lines**: Three distinct lines with markers (blue, orange, green) and shaded regions.
### Detailed Analysis
1. **Step 5000 (Blue Line)**:
- Starts at ~7.0 (Layer 1) and decreases gradually to ~6.5 (Layer 12).
- Shaded region narrows slightly, suggesting reduced variability at higher layers.
2. **Step 10000 (Orange Line)**:
- Begins at ~6.5 (Layer 1) and declines to ~5.3 (Layer 12).
- Shaded region widens initially, then stabilizes.
3. **Step 20000 (Green Line)**:
- Starts at ~6.5 (Layer 1) and drops sharply to ~4.8 (Layer 12).
- Shaded region is the widest, indicating higher uncertainty.
### Key Observations
- **Trend**: All lines decrease monotonically, with steeper declines for higher steps.
- **Divergence**: Step 20000 (green) diverges most sharply from the others, especially after Layer 6.
- **Variability**: Shaded regions suggest measurement noise or model uncertainty, with Step 20000 showing the greatest spread.
### Interpretation
The data suggests that increasing the "step" parameter (possibly iterations or data points) correlates with reduced surprisal, implying improved model predictability or stability. The sharpest decline in Step 20000 may indicate a threshold effect, where additional steps significantly refine the model's performance. The shaded regions highlight that higher steps (e.g., 20000) involve greater experimental or computational variability, possibly due to longer processing times or larger datasets. The consistent downward trend across all steps implies that the underlying process (e.g., neural network training, statistical modeling) becomes more deterministic with increased computational effort.
</details>
(b) Layer-wise tuned lens to predict the $\langle$ LAN $\rangle$ token in match condition.
Figure 5: Overtime mechanistic analysis on GPT-CHILDES.
### 5.1 The Emergence of Symbol Grounding
To provide a mechanistic account of symbol grounding, i.e., when it emerges during training and how it is represented in the network, we apply two interpretability analyses.
Saliency flow. For each layer $\ell$ , we compute a saliency matrix following Wang et al. (2023): $I_{\ell}=\left|\sum_{h}A_{h,\ell}\odot\frac{\partial\mathcal{L}}{\partial A_{h,\ell}}\right|$ , where $A_{h,\ell}$ denotes the attention matrix of head $h$ in layer $\ell$ . Each entry of $I_{\ell}$ quantifies the contribution of the corresponding attention weight to the cross-entropy loss $\mathcal{L}$ , averaged across heads. Our analysis focuses on ground-to-symbol connections, i.e., flows from environmental ground ( $\langle$ ENV $\rangle$ ) tokens to the token immediately preceding (and predicting) their linguistic forms ( $\langle$ LAN $\rangle$ ).
Probing with the Tuned Lens. We probe layer-wise representations using the Tuned Lens (Belrose et al., 2023), which trains affine projectors to map intermediate activations to the final prediction space while keeping the LM output head frozen.
Results. Ground-to-symbol saliency is weak in the early stages of training but rises sharply later, peaking in layers 7â9 (Figure 5(a)), suggesting that mid-layer attention plays a central role in establishing symbolâground correspondences. In addition, Figure 5(b) shows that early layers remain poor predictors even at late training stages (e.g., after 20,000 steps), whereas surprisal begins to drop markedly from layer 7 at intermediate stages (step 10,000), suggesting a potential representational shift in the middle layers.
### 5.2 Hypothesis: Gather-and-Aggregate Heads Implement Symbol Grounding
Building on these results, we hypothesize that specific Transformer heads in the middle layers enable symbol grounding. To test this, we examine attention saliencies for selected heads (Figure 6). We find that several heads exhibit patterns consistent with the gather and aggregate mechanisms described by Bick et al. (2025): gather heads (e.g., Figures 6(a) and 6(b)) compress relevant information into a subset of positions, while aggregate heads (e.g., Figures 6(c) and 6(d)) redistribute this information to downstream tokens. In our setups, saliency often concentrates on environmental tokens such as train ${}_{\texttt{$\langle$ENV$\rangle$}}$ , where gather heads pool contextual information into compact, retrievable states. In turn, aggregate heads broadcast this information from environmental ground (train $\langle$ ENV $\rangle$ ) to the token immediately preceding the linguistic form, thereby supporting the prediction of train ${}_{\texttt{$\langle$LAN$\rangle$}}$ . Taking these observations together, we hypothesize that the gather-and-aggregate heads implement the symbol grounding mechanism.
<details>
<summary>x18.png Details</summary>

### Visual Description
## Heatmap: Co-occurrence Analysis of Contextual Elements
### Overview
The image is a heatmap visualizing relationships between contextual elements across three domains: <CHI> (likely "Character"), <ENV> (Environment), and <LAN> (Language). The grid uses color intensity to represent associations between row and column labels, with a highlighted yellow rectangle emphasizing a specific interaction.
### Components/Axes
- **Rows (Y-axis)**:
- `<CHI>` (Character)
- `<ENV>` (Environment)
- `<LAN>` (Language)
- **Columns (X-axis)**:
- `<CHI>` (Character)
- `<ENV>` (Environment)
- `<LAN>` (Language)
- Words: "saw," "a," "train," "passing," "by," "i," "want," "to," "ride," "that"
- **Legend**:
- Green (light intensity)
- Blue (medium intensity)
- Purple (dark intensity)
- Yellow (highlighted anomaly)
- **Highlighted Area**: Yellow rectangle around "train" (row `<ENV>`) and "passing" (column `<CHI>`).
### Detailed Analysis
- **Color Distribution**:
- **<CHI> Row**: Predominantly purple (dark) with sparse green (light) in the `<CHI>` column.
- **<ENV> Row**: Mixed colors, with the highlighted yellow rectangle at the intersection of "train" (row) and "passing" (column).
- **<LAN> Row**: Gradual shift from purple (dark) to teal (medium) in the `<LAN>` column.
- **Key Values**:
- The yellow rectangle suggests an outlier or significant interaction between "train" (environment) and "passing" (character).
- No numerical values are provided; color intensity is qualitative.
### Key Observations
1. **Highlighted Interaction**: The yellow rectangle draws attention to the relationship between "train" (environment) and "passing" (character), implying a notable or unexpected association.
2. **Color Gradient**: Darker purple dominates the grid, suggesting stronger associations in most interactions, while lighter green appears sparsely.
3. **Domain-Specific Patterns**:
- `<LAN>` row shows a gradient from dark purple to teal, indicating diminishing intensity toward the `<LAN>` column.
- `<ENV>` row has the most variability, with the highlighted area breaking the pattern.
### Interpretation
The heatmap likely represents co-occurrence or contextual relationships between elements in a narrative or dataset. The highlighted yellow area suggests a critical interaction between environmental "train" and character "passing," possibly indicating a focal point in a story or dataset. The lack of numerical values limits quantitative analysis, but the color gradient implies a hierarchy of associations, with darker colors representing stronger links. The <LAN> rowâs gradient may reflect diminishing relevance or frequency as the column shifts toward language-specific terms. The highlighted anomaly warrants further investigation to determine its significance in the broader context.
</details>
(a) Gather: L4 H7.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Heatmap: Word Frequency/Intensity Across Contexts
### Overview
The image is a heatmap visualizing the intensity of specific words across three labeled contexts: `<CHI>`, `<ENV>`, and `<LAN>`. The grid is organized with rows representing contexts and columns representing words. Colors range from purple (low intensity) to yellow (high intensity), with a legend on the right indicating the scale.
### Components/Axes
- **Rows (Vertical Labels)**:
- `<CHI>` (Context 1)
- `<ENV>` (Context 2)
- `<LAN>` (Context 3)
- **Columns (Horizontal Labels)**:
- `<CHI>`
- `saw`
- `a`
- `train`
- `passing`
- `by`
- `<CHI>`
- `i`
- `want`
- `to`
- `ride`
- `that`
- **Legend**:
- Purple (low intensity)
- Green (medium intensity)
- Yellow (high intensity)
### Detailed Analysis
- **<CHI> Row**:
- `saw`: Purple (low)
- `a`: Purple (low)
- `train`: Purple (low)
- `passing`: Purple (low)
- `by`: Purple (low)
- `<CHI>`: Purple (low)
- `i`: Purple (low)
- `want`: Purple (low)
- `to`: Purple (low)
- `ride`: Purple (low)
- `that`: Purple (low)
- **<ENV> Row**:
- `saw`: Purple (low)
- `a`: Purple (low)
- `train`: Yellow (high)
- `passing`: Purple (low)
- `by`: Purple (low)
- `<CHI>`: Purple (low)
- `i`: Purple (low)
- `want`: Purple (low)
- `to`: Purple (low)
- `ride`: Purple (low)
- `that`: Purple (low)
- **<LAN> Row**:
- `saw`: Purple (low)
- `a`: Purple (low)
- `train`: Purple (low)
- `passing`: Blue (medium)
- `by`: Green (medium)
- `<CHI>`: Purple (low)
- `i`: Purple (low)
- `want`: Purple (low)
- `to`: Purple (low)
- `ride`: Green (medium)
- `that`: Blue (medium)
### Key Observations
1. **High Intensity**: The only yellow cell is in the `<ENV>` row under the `train` column, indicating it is the most intense word in this context.
2. **Medium Intensity**: The `<LAN>` row has three cells with medium intensity (blue/green): `passing`, `by`, and `ride`.
3. **Low Intensity**: Most cells are purple, suggesting low frequency or importance across contexts.
4. **Repetition**: The word `<CHI>` appears twice in the column labels, possibly indicating a recursive or nested context.
### Interpretation
The heatmap suggests that the word `train` holds significant importance in the `<ENV>` context, while other words like `passing`, `by`, and `ride` have moderate relevance in the `<LAN>` context. The dominance of purple in the `<CHI>` row implies minimal intensity for most words in this context. The repetition of `<CHI>` in both row and column labels may indicate a hierarchical or recursive relationship between contexts. The lack of high-intensity cells in `<LAN>` and `<CHI>` rows suggests these contexts are less emphasized compared to `<ENV>`.
</details>
(b) Gather: L4 H8.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Heatmap: Semantic Association Matrix
### Overview
The image depicts a 3x3 heatmap matrix comparing semantic associations between three categories: `<CHI>` (Child), `<ENV>` (Environment), and `<LAN>` (Language). The matrix uses color intensity to represent the strength of associations between category labels and embedded text tokens, with a legend indicating dark purple (low), medium purple (medium), yellow (high), and teal (very high) values.
### Components/Axes
- **Rows (Left Labels)**:
- `<CHI>` (Child)
- `<ENV>` (Environment)
- `<LAN>` (Language)
- **Columns (Bottom Labels)**:
- `<CHI>` (Child)
- `<ENV>` (Environment)
- `<LAN>` (Language)
- **Legend**:
- Dark purple: Low association
- Medium purple: Medium association
- Yellow: High association
- Teal: Very high association
- **Text Tokens**: Embedded in cells (e.g., "saw", "train", "passing", "by", "i", "want", "to", "ride", "that").
### Detailed Analysis
1. **<CHI> Row**:
- `<CHI>` column: Dark purple ("saw", "a", "train", "passing", "by").
- `<ENV>` column: Medium purple ("i", "want", "to", "ride", "that").
- `<LAN>` column: Medium purple ("i", "want", "to", "ride", "that").
2. **<ENV> Row**:
- `<CHI>` column: Medium purple ("saw", "a", "train", "passing", "by").
- `<ENV>` column: Medium purple ("i", "want", "to", "ride", "that").
- `<LAN>` column: Medium purple ("i", "want", "to", "ride", "that").
3. **<LAN> Row**:
- `<CHI>` column: Medium purple ("saw", "a", "train", "passing", "by").
- `<ENV>` column: Medium purple ("i", "want", "to", "ride", "that").
- `<LAN>` column: Teal ("ride") and medium purple ("that").
### Key Observations
- **Strongest Association**: The teal cell at `<LAN>` (row) and "ride" (column) indicates the highest association strength.
- **Medium Associations**: Most cells contain medium purple values, suggesting moderate semantic links.
- **Low Associations**: Dark purple dominates the diagonal (`<CHI>`-`<CHI>`, `<ENV>`-`<ENV>`, `<LAN>`-`<LAN>`), except for `<LAN>`-`<LAN>` where "ride" stands out.
- **Text Distribution**: Words like "saw", "train", and "passing" cluster in the `<CHI>` row/column, while "i", "want", "to", "ride", and "that" dominate other regions.
### Interpretation
The heatmap suggests a semantic network where:
- **Child (`<CHI>`)** is most strongly associated with action verbs ("saw", "passing") and environmental interactions ("train").
- **Environment (`<ENV>`)** and **Language (`<LAN>`)** share overlapping associations with abstract concepts ("want", "to", "ride", "that"), indicating potential syntactic or conceptual links.
- The teal "ride" in `<LAN>`-`<LAN>` implies a critical self-referential link, possibly denoting a core linguistic or conceptual node.
- The diagonal dominance of dark purple suggests category-specific specialization, while off-diagonal medium values indicate cross-category interactions.
This matrix could represent word co-occurrence frequencies, topic modeling results, or user interaction patterns in a semantic analysis task.
</details>
(c) Aggregate: L7 H5.
<details>
<summary>x21.png Details</summary>

### Visual Description
## Heatmap: Relationship Analysis Across CHI, ENV, and LAN Labels
### Overview
The image is a heatmap divided into three labeled sections (<CHI>, <ENV>, <LAN>) on both axes. Color intensity (purple, yellow, teal) represents data point values, with darker purple indicating lower values and brighter colors (yellow, teal) indicating higher values. The layout suggests a comparative analysis of relationships or interactions between the labeled categories.
### Components/Axes
- **Left Axis (Vertical):**
- Labels: `<CHI>`, `<ENV>`, `<LAN>`
- Sub-labels under each:
- `<CHI>`: "saw", "a", "train", "passing", "by"
- `<ENV>`: "i", "want", "to", "ride", "that"
- `<LAN>`: "i", "want", "to", "ride", "that"
- **Bottom Axis (Horizontal):**
- Labels: `<CHI>`, `<ENV>`, `<LAN>`
- Sub-labels under each:
- `<CHI>`: "saw", "a", "train", "passing", "by"
- `<ENV>`: "i", "want", "to", "ride", "that"
- `<LAN>`: "i", "want", "to", "ride", "that"
- **Legend (Bottom-Right):**
- Purple: Low-value data points
- Yellow: High-value data points
- Teal: Moderate-to-high-value data points
### Detailed Analysis
1. **Top-Left Section (<CHI> Ă <CHI>):**
- Dominated by dark purple cells, indicating low values.
- Exceptions:
- "saw" (row) and "train" (column) show slightly lighter purple.
- "passing" (row) and "by" (column) have a faint teal cell.
2. **Middle Section (<ENV> Ă <ENV>):**
- Mostly dark purple, but:
- "train" (row) and "passing" (column) have a bright yellow cell, suggesting a high-value interaction.
- "i" (row) and "want" (column) show a teal cell.
3. **Bottom Section (<LAN> Ă <LAN>):**
- Dark purple dominates, except:
- "to" (row) and "ride" (column) have a teal cell.
- "that" (row) and "that" (column) show a mix of teal and purple.
4. **Diagonal Pattern:**
- A gradient from dark purple (top-left) to brighter colors (bottom-right) suggests increasing interaction strength or relevance.
### Key Observations
- **High-Value Interactions:**
- `<ENV>`'s "train" and "passing" (yellow) and `<LAN>`'s "to" and "ride" (teal) are notable outliers.
- **Low-Value Dominance:**
- Most cells are dark purple, indicating weak or absent relationships.
- **Color Consistency:**
- Yellow and teal cells align with the legend, confirming their significance.
### Interpretation
The heatmap likely visualizes semantic or contextual relationships between labeled categories. The bright yellow in `<ENV>`'s "train" and "passing" suggests a strong environmental focus on transportation or movement. The teal in `<LAN>`'s "to" and "ride" may indicate moderate relevance to language or location-based actions. The diagonal trend implies a correlation between the labels and their sub-labels, with interactions strengthening toward the bottom-right. The sparse high-value cells highlight specific focal points, while the majority of dark purple cells suggest limited or weak connections elsewhere.
</details>
(d) Aggregate: L8 H5.
Figure 6: Examples of gather and aggregate heads identified in GPT-CHILDES. L: layer; H: head.
Table 2: Causal intervention results on identified gather and aggregate heads across training checkpoints (ckpt.). Avg. Count denotes the average number of heads of each type over inference times, and Avg. Layer denotes the average layer index where they appear. Interv. Sps. reports surprisal after zeroing out the identified heads, while Ctrl. Sps. reports surprisal after zeroing out an equal number of randomly selected heads. Original refers to the baseline surprisal without any intervention. *** indicates a significant result ( $p<0.001$ ) where the intervention surprisal is higher than that in the corresponding control experiment.
| Ckpt. | Gather Head | Aggregate Head | Original | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Avg. | Avg. | Interv. | Ctrl. | Avg. | Avg. | Interv. | Ctrl. | | |
| Count | Layer | Sps. | Sps. | Count | Layer | Sps. | Sps. | | |
| 500 | 0.00 | - | - | - | 0.07 | 8.74 | 9.34 | 9.34 | 9.34 |
| 5000 | 0.35 | 3.32 | 6.37 | 6.38 | 2.28 | 7.38 | 6.51 | 6.39 | 6.38 |
| (***) | | | | | | | | | |
| 10000 | 3.26 | 3.67 | 5.25 | 5.32 | 5.09 | 7.28 | 5.86 | 5.29 | 5.30 |
| (***) | | | | | | | | | |
| 20000 | 5.76 | 3.59 | 4.69 | 4.79 | 6.71 | 7.52 | 5.62 | 4.76 | 4.77 |
| (***) | | | | | | | | | |
### 5.3 Causal Interventions of Attention Heads
We then conduct causal interventions of attention heads to validate our previous hypothesis.
Operational definition. We identify attention heads as gather or aggregate following these standards:
- Gather head. An attention head is classified as a gather head if at least 30% of its total saliency is directed toward the environmental ground token from the previous ones.
- Aggregate head: An attention head is classified as an aggregate head if at least 30% of its total saliency flows from the environmental ground token to the token immediately preceding the corresponding linguistic token.
Causal intervention methods. In each context, we apply causal interventions to the identified head types and their corresponding controls. Following Bick et al. (2025), interventions are implemented by zeroing out the outputs of heads. For the control, we mask an equal number of randomly selected heads in each layer, ensuring they do not overlap with the identified gather or aggregate heads.
| Thres. | Ckpt. | Aggregate Head | Original | | | |
| --- | --- | --- | --- | --- | --- | --- |
| Avg. | Avg. | Interv. | Ctrl. | | | |
| Count | Layer | Sps. | Sps. | | | |
| 70% | 20k | 32.30 | 7.78 | 9.96 | 9.95 | 9.21 |
| 100k | 35.63 | 7.71 | 9.42 | 8.84 | 8.24 | |
| (***) | | | | | | |
| 200k | 34.99 | 7.80 | 8.95 | 8.15 | 7.76 | |
| (***) | | | | | | |
| 300k | 34.15 | 7.76 | 8.96 | 8.11 | 7.69 | |
| (***) | | | | | | |
| 90% | 20k | 10.66 | 8.33 | 9.51 | 9.43 | 9.21 |
| (***) | | | | | | |
| 100k | 13.90 | 8.26 | 8.95 | 8.50 | 8.24 | |
| (***) | | | | | | |
| 200k | 13.47 | 8.46 | 8.41 | 7.88 | 7.76 | |
| (***) | | | | | | |
| 300k | 12.73 | 8.42 | 8.40 | 7.87 | 7.69 | |
| (***) | | | | | | |
<details>
<summary>x22.png Details</summary>

### Visual Description
## Heatmap: Value Distribution Across Layers and Steps
### Overview
The image is a heatmap visualizing the distribution of values across 12 layers (x-axis) and 9 step increments (y-axis). Colors transition from dark purple (low values) to light yellow (high values), with a color bar indicating a scale from 0 to 0.008. The data suggests a spatial relationship between step progression and layer-specific value intensity.
### Components/Axes
- **Y-Axis (Steps)**: Labeled in increments of 30k (30k, 60k, ..., 300k).
- **X-Axis (Layer)**: Labeled 1 to 12.
- **Color Bar**: Horizontal bar at the bottom with values from 0 (dark purple) to 0.008 (light yellow).
- **Legend**: Implicit via color bar; no explicit legend present.
### Detailed Analysis
- **Step 30k**: All layers show dark purple (values near 0).
- **Step 60k**: Layers 1â4 remain dark purple; layers 5â12 transition to purple-red.
- **Step 90k**: Layers 5â7 shift to red-orange; layers 8â10 become orange.
- **Step 120k**: Layers 6â10 exhibit orange-yellow gradients; layers 1â5 and 11â12 remain darker.
- **Step 150k**: Layers 7â9 peak at light orange; layers 6 and 10 show moderate orange.
- **Step 180k**: Layers 8â10 reach light yellow (highest values); layers 7 and 11 show orange.
- **Step 210k**: Layers 8â10 maintain light yellow; layers 7 and 11 transition to orange.
- **Step 240k**: Layers 8â10 remain light yellow; layers 7 and 11 shift to orange-red.
- **Step 270k**: Layer 9 peaks at light yellow; layers 8 and 10 show orange.
- **Step 300k**: Layers 8â10 revert to orange-red; layers 7 and 11 return to dark purple.
### Key Observations
1. **Peak Value**: The highest value (light yellow) occurs at **Step 270k, Layer 9** (â0.008).
2. **Trend**: Values increase with steps up to 270k, then decline at 300k.
3. **Layer-Specific Patterns**:
- Layers 8â10 consistently show higher values in the middle steps (180kâ270k).
- Layers 1â5 and 11â12 exhibit lower values across most steps.
4. **Anomaly**: Layer 12 shows a sudden drop to dark purple at 300k steps, contrasting with its moderate values at earlier steps.
### Interpretation
The heatmap suggests that **Layer 9** is the most critical or efficient layer, achieving the highest value at Step 270k. The trend indicates diminishing returns after 270k steps, possibly due to saturation or system limitations. Layers 8â10 dominate the high-value region, implying their importance in the process. The anomaly in Layer 12 at 300k steps may reflect an outlier or a design flaw. The color gradient confirms that value intensity correlates with step progression, peaking mid-process before declining.
</details>
Figure 7: Mechanistic analysis in the image-grounded visual dialogue setting. Left: Causal intervention results on identified aggregate heads across training checkpoints, where intervention on aggregate heads consistently yields significantly higher surprisal ( $p<0.001$ , ***) compared to the control group ones. Right: Saliency of layer-wise attention from environmental tokens (i.e., image tokens corresponding to patches within the bounding boxes of the target object) to linguistic tokens across training steps.
Results and discussions. As training progresses, the number of both gather and aggregate heads increases (Table 2), suggesting that these mechanisms emerge over the course of learning. Causal interventions reveal a clear dissociation: zeroing out aggregate heads consistently produces significantly higher surprisal compared to controls, whereas the gather head interventions have no such effect. This asymmetry suggests that gather heads serve in a role less critical in our settings, where the input template is semantically light and the environmental evidence alone suffices to shape the linguistic form. Layer-wise patterns further support this division of labor: gather heads cluster in shallow layers (3-4), while aggregate heads concentrate in mid layers (7-8). This resonates with our earlier probing results, where surprisal reductions became prominent only from layers 7-9. Together, these findings highlight aggregate heads in the middle layers as the primary account of grounding in the model.
### 5.4 Generalization to Visual Dialog with Images
We also conduct causal interventions of attention heads on the VLM model to further validate our previous hypothesis.
Operational definition. We identify attention heads as aggregate following this standard (We do not define gather head): An attention head is classified as an aggregate head if at least a certain threshold (70% or 90% in our experiment settings) of its total image patch to end saliency flows from the patches inside bounding box to the token immediately preceding the corresponding linguistic token.
Causal intervention methods. In each context, we apply causal interventions to the identified head types and their corresponding controls in the language backbone of the model. Similar to section 5.3, interventions are implemented by zeroing out a headâs outputs. For the control, we mask an equal number of randomly selected heads in each layer, ensuring they do not overlap with the identified gather or aggregate heads.
Results and discussions. As training progresses, the number of aggregate heads increases first and then becomes steady (Figure 7), suggesting that these mechanisms emerge over the course of learning. Causal interventions reveal that zeroing out aggregate heads consistently produces significantly higher surprisal rises compared to controls. The average layer also align with the saliency heatmap, also shown in Figure 7.
## 6 Discussions
Generalization to full-scale VLMs. As an additional case study, we extend our grounding-as-aggregation hypothesis to a full-scale VLM, LLaVA-1.5-7B (Liu et al., 2023). Even in this heavily engineered architecture, we identify many attention heads exhibiting aggregation behavior consistent with our earlier findings (Figure 1(b)), reinforcing the view that symbol grounding arises from specialized heads. At the same time, full-scale VLMs present additional complications. Models like LLaVA use multiple sets of visual tokens, including CLIP-derived embeddings that already encode language priors, and global information may be stored in redundant artifact tokens rather than object-centric regions (Darcet et al., 2024). Moreover, the large number of visual tokens (environmental tokens, in our setup) substantially increases both computational cost and the difficulty of isolating genuine aggregation heads. These factors make systematic identification and intervention at scale a nontrivial challenge. For these reasons, while our case study highlights promising evidence of grounding heads in modern VLMs, systematic detection and causal evaluation of such heads at scale remains an open challenge. Future work will need to develop computationally viable methods for (i) automatically detecting aggregation heads across diverse VLMs, and (ii) applying causal interventions to validate their role in grounding. Addressing these challenges will be crucial for moving from anecdotal case studies to a more principled understanding of grounding in modern VLMs.
The philosophical roots of grounding, revisited. Our findings highlight the need to sharpen the meaning of grounding in multimodal models. Prior work has often equated grounding with statistical correlations between visual and textual signals, such as attention overlaps or geometric alignments (Bousselham et al., 2024; Cao et al., 2025; Schnaus et al., 2025). While informative, such correlations diverge from the classic formulation by Harnad (1990), which requires symbols to be causally anchored to their referents in the environment. On the other extreme, Gubelmann (2024) argued that the symbol grounding problem does not apply to LLMs as they âare connectionist, statistical devices that have no intrinsic symbolic structure.â In contrast, we discover emergent symbolic structure as an intrinsic mechanistic property: one that can be traced along training, observed in the specialization of attention heads, and validated through causal interventions. This provides not only a practical diagnostic protocol that reveals when and how models genuinely tie symbols to meaning beyond surface-level correlations, but also challenges the view that grounding is philosophically irrelevant to systems without explicit symbolic structure.
Practical implications to LM hallucinations. Our findings have practical implications for improving the reliability of LM outputs: by identifying aggregation heads that mediate grounding between environmental and linguistic tokens, we provide a promising mechanism to detect model reliability before generation. Our findings echo a pathway to mitigate hallucinations by focusing on attention control: many hallucination errors stem from misallocated attention in intermediate layers (Jiang et al., 2025; Chen et al., 2024b). Such attention-level signals can serve as early indicators of overtrust or false grounding, motivating practical solutions like decoding-time strategies to mitigate and eventually prevent hallucination (Huang et al., 2024).
## Acknowledgement
This work was supported in part by NSF IIS-1949634, NSF SES-2128623, NSERC RGPIN-2024-04395, the Weinberg Cognitive Science Fellowship to ZM, a Vector Scholarship to XL, and a Canada CIFAR AI Chair award to FS. The authors would like to thank Songlin Yang and Jing Ding for their valuable feedback.
## References
- Anthropic (2024) Anthropic. The claude 3 model family: Opus, sonnet, haiku, March 2024. URL https://www.anthropic.com/news/claude-3-family.
- Arora et al. (2025) Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, RĂłbert CsĂłrdas, Dan Jurafsky, and Christopher Potts. Mechanistic evaluation of transformers and state space models. arXiv preprint arXiv:2505.15105, 2025.
- Belrose et al. (2023) Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023.
- Bick et al. (2025) Aviv Bick, Eric P. Xing, and Albert Gu. Understanding the skill gap in recurrent models: The role of the gather-and-aggregate mechanism. In Forty-second International Conference on Machine Learning, 2025.
- Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle OâBrien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397â2430. PMLR, 2023.
- Bietti et al. (2023) Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 2023.
- Blevins et al. (2022) Terra Blevins, Hila Gonen, and Luke Zettlemoyer. Analyzing the mono-and cross-lingual pretraining dynamics of multilingual language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3575â3590, 2022.
- Bousselham et al. (2024) Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localization properties in vision-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3828â3837, 2024.
- Cao et al. (2025) Shengcao Cao, Liang-Yan Gui, and Yu-Xiong Wang. Emerging pixel grounding in large multimodal models without grounding supervision. In International Conference on Machine Learning, 2025.
- Chang & Bergen (2022) Tyler A Chang and Benjamin K Bergen. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10:1â16, 2022.
- Chang et al. (2024) Tyler A Chang, Zhuowen Tu, and Benjamin K Bergen. Characterizing learning curves during language model pre-training: Learning, forgetting, and stability. Transactions of the Association for Computational Linguistics, 12:1346â1362, 2024.
- Chen et al. (2024a) Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. arXiv preprint arXiv:2406.16866, 2024a.
- Chen et al. (2023) Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llmâs referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Chen et al. (2024b) Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, and Joyce Chai. Multi-object hallucination in vision language models. Advances in Neural Information Processing Systems, 37:44393â44418, 2024b.
- Clark (1995) Eve V Clark. The lexicon in acquisition. Number 65. Cambridge University Press, 1995.
- Comanici et al. (2025) Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025.
- Dao & Gu (2024) Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, pp. 10041â10071. PMLR, 2024.
- Darcet et al. (2024) TimothĂŠe Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Representations, 2024.
- Das et al. (2017) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, JosĂŠ MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 326â335, 2017.
- Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Evanson et al. (2023) Linnea Evanson, Yair Lakretz, and Jean-RĂŠmi King. Language acquisition: do children and language models follow similar learning stages? In Findings of the Association for Computational Linguistics: ACL 2023, pp. 12205â12218, 2023.
- Fazly et al. (2010) Afsaneh Fazly, Afra Alishahi, and Suzanne Stevenson. A probabilistic computational model of cross-situational word learning. Cognitive Science, 34(6):1017â1063, 2010.
- Fenson et al. (2006) Larry Fenson, Virginia A Marchman, Donna J Thal, Phillip S Dale, J Steven Reznick, and Elizabeth Bates. Macarthur-bates communicative development inventories. PsycTESTS Dataset, 2006.
- Gleitman & Landau (1994) Lila R Gleitman and Barbara Landau. The acquisition of the lexicon. MIT Press, 1994.
- Goodman et al. (2007) Noah Goodman, Joshua Tenenbaum, and Michael Black. A bayesian framework for cross-situational word-learning. Advances in neural information processing systems, 20, 2007.
- Gubelmann (2024) Reto Gubelmann. Pragmatic norms are all you needâwhy the symbol grounding problem does not apply to llms. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 11663â11678, 2024.
- Hagendorff (2023) Thilo Hagendorff. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. arXiv preprint arXiv:2303.13988, 2023.
- Harnad (1990) Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335â346, 1990.
- Hochreiter & Schmidhuber (1997) Sepp Hochreiter and JĂźrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735â1780, 1997.
- Huang et al. (2024) Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13418â13427, 2024.
- Jiang et al. (2025) Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 25004â25014, 2025.
- Kangaslahti et al. (2025) Sara Kangaslahti, Elan Rosenfeld, and Naomi Saphra. Hidden breakthroughs in language model training. arXiv preprint arXiv:2506.15872, 2025.
- Li et al. (2022) Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965â10975, 2022.
- Lieberum et al. (2023) Tom Lieberum, Matthew Rahtz, JĂĄnos KramĂĄr, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik. Does circuit analysis interpretability scale? Evidence from multiple choice capabilities in chinchilla. arXiv preprint arXiv:2307.09458, 2023.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr DollĂĄr, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740â755. Springer, 2014.
- Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in neural information processing systems, volume 36, pp. 34892â34916, 2023.
- Lu et al. (2024) Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, and Iryna Gurevych. Are emergent abilities in large language models just in-context learning? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5098â5139, 2024.
- Ma et al. (2023) Ziqiao Ma, Jiayi Pan, and Joyce Chai. World-to-words: Grounded open vocabulary acquisition through fast mapping in vision-language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 524â544, 2023.
- Ma et al. (2025) Ziqiao Ma, Zekun Wang, and Joyce Chai. Babysit a language model from scratch: Interactive language learning by trials and demonstrations. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 991â1010, 2025.
- MacWhinney (2000) Brian MacWhinney. The childes project: Tools for analyzing talk: Volume i: Transcription format and programs, volume ii: The database, 2000.
- Mao et al. (2019) Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, sentences from natural supervision. International Conference on Learning Representations (ICLR), 2019.
- Mao et al. (2021) Jiayuan Mao, Freda H. Shi, Jiajun Wu, Roger P. Levy, and Joshua B. Tenenbaum. Grammar-based grounded lexicon learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
- Meng et al. (2022) Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, 2022.
- Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- OpenAI (2024) OpenAI. Hello gpt-4o, May 2024. URL https://openai.com/index/hello-gpt-4o/.
- Oquab et al. (2024) Maxime Oquab, TimothĂŠe Darcet, ThĂŠo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal, pp. 1â31, 2024.
- Peng et al. (2024) Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Grounding multimodal large language models to the world. In The Twelfth International Conference on Learning Representations, 2024.
- Pratt et al. (2020) Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, and Aniruddha Kembhavi. Grounded situation recognition. In European Conference on Computer Vision, pp. 314â332. Springer, 2020.
- Qu & Chai (2010) Shaolin Qu and Joyce Yue Chai. Context-based word acquisition for situated dialogue in a virtual world. Journal of Artificial Intelligence Research, 37:247â277, 2010.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748â8763. PmLR, 2021.
- Rasheed et al. (2024) Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Regier (2005) Terry Regier. The emergence of words: Attentional learning in form and meaning. Cognitive science, 29(6):819â865, 2005.
- Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and BjĂśrn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684â10695, 2022.
- Roy & Pentland (2002) Deb K Roy and Alex P Pentland. Learning words from sights and sounds: A computational model. Cognitive science, 26(1):113â146, 2002.
- Sabet et al. (2020) Masoud Jalili Sabet, Philipp Dufter, François Yvon, and Hinrich Schßtze. Simalign: High quality word alignments without parallel training data using static and contextualized embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
- Schaeffer et al. (2023) Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36, 2023.
- Schnaus et al. (2025) Dominik Schnaus, Nikita Araslanov, and Daniel Cremers. Itâs a (blind) match! Towards vision-language correspondence without parallel data. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 24983â24992, 2025.
- Sellam et al. (2021) Thibault Sellam, Steve Yadlowsky, Ian Tenney, Jason Wei, Naomi Saphra, Alexander DâAmour, Tal Linzen, Jasmijn Bastings, Iulia Raluca Turc, Jacob Eisenstein, et al. The multiberts: Bert reproductions for robustness analysis. In International Conference on Learning Representations, 2021.
- Shi et al. (2021) Haoyue Shi, Luke Zettlemoyer, and Sida I. Wang. Bilingual lexicon induction via unsupervised bitext construction and word alignment. In ACL, 2021.
- Siskind (1996) Jeffrey Mark Siskind. A computational study of cross-situational techniques for learning word-to-meaning mappings. Cognition, 61(1-2):39â91, 1996.
- van der Wal et al. (2025) Oskar van der Wal, Pietro Lesci, Max MĂźller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. Polypythias: Stability and outliers across fifty language model pre-training runs. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025), pp. 1â25, 2025.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ĺukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Wang et al. (2023) Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9840â9855, 2023.
- Wang et al. (2024) Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. Cogvlm: Visual expert for pretrained language models. Advances in Neural Information Processing Systems, 37:121475â121499, 2024.
- Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
- Wiegreffe et al. (2025) Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov, Hannaneh Hajishirzi, and Ashish Sabharwal. Answer, assemble, ace: Understanding how LMs answer multiple choice questions. In The Thirteenth International Conference on Learning Representations, 2025.
- Wu et al. (2025a) Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality. In The Thirteenth International Conference on Learning Representations, 2025a.
- Wu et al. (2025b) Zhaofeng Wu, Dani Yogatama, Jiasen Lu, and Yoon Kim. The semantic hub hypothesis: Language models share semantic representations across languages and modalities. In ICML, 2025b.
- Xia et al. (2023) Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Ves Stoyanov. Training trajectories of language models across scales. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13711â13738, 2023.
- Xia et al. (2024) Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Xu & Tenenbaum (2007) Fei Xu and Joshua B Tenenbaum. Word learning as bayesian inference. Psychological review, 114(2):245, 2007.
- You et al. (2024) Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. In The Twelfth International Conference on Learning Representations, 2024.
- Yu (2005) Chen Yu. The emergence of links between lexical acquisition and object categorization: A computational study. Connection science, 17(3-4):381â397, 2005.
- Yu & Ballard (2007) Chen Yu and Dana H Ballard. A unified model of early word learning: Integrating statistical and social cues. Neurocomputing, 70(13-15):2149â2165, 2007.
- Yu & Siskind (2013) Haonan Yu and Jeffrey Mark Siskind. Grounded language learning from video described with sentences. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 53â63, 2013.
- Zhang et al. (2024a) Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding. Advances in neural information processing systems, 37:71737â71767, 2024a.
- Zhang et al. (2024b) Yichi Zhang, Ziqiao Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi Gao, and Joyce Chai. Groundhog: Grounding large language models to holistic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024b.
- Zhao et al. (2024) Rosie Zhao, Naomi Saphra, and Sham M. Kakade. Distributional scaling laws for emergent capabilities. In NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024.
## Appendix A Dataset Details
### A.1 Context Templates
We select the target tokens following the given procedure:
1. Get a list of words, with their ENV and LAN frequency both greater than or equal to 100 in the CHILDES dataset;
1. Get another list of nouns from CDI;
1. Take intersection and select top 100 words (by frequency of their ENV token) as target token list.
In CHILDES, all contexts are created with gpt-4o-mini followed by human verification if the genrated contexts are semantically light. We adopt the following prompt:
Prompt Templates for CHILDES
Given the word â{word}â, create 3 pairs of sentences that follow this requirement: 1. The first sentence has a subject âThe childâ, describing an event or situation, and has the word â{word}â. Make sure to add a newline to the end of this first sentence 2. The second sentence is said by the child (only include the speech itself, donât include âthe child sayâ, etc.), and the word â{word}â also appears in the sentence said by the child. Do not add quote marks either 3. Print each sentence on one line. Do not include anything else. 4. Each sentence should be short, less than 10 words. 5. The word â{word}â in both sentence have the same meaning and have a clear indication or an implication relationship. 6. â{word}â should not appear at the first/second word of each sentence. Generate 3 pairs of such sentences, so there should be 6 lines in total. You should not add a number. For each line, just print out the sentence.
In visual dialogue (caption version and VLM version), we pre-define 10 sets of templates for each version:
Prompt Templates for Visual Dialogue (Caption Version)
this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> what:<LAN> is:<LAN> it:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> what:<LAN> do:<LAN> you:<LAN> call:<LAN> this:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> can:<LAN> you:<LAN> name:<LAN> this:<LAN> object:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> whatâs:<LAN> this:<LAN> called:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> what:<LAN> this:<LAN> thing:<LAN> is:<LAN> <A> (predict [FILLER]:<LAN>)
Prompt Templates for Visual Dialogue (Caption Version) (continued)
this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> what:<LAN> would:<LAN> you:<LAN> name:<LAN> this:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> whatâs:<LAN> the:<LAN> name:<LAN> of:<LAN> this:<LAN> item:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> how:<LAN> do:<LAN> you:<LAN> identify:<LAN> this:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> what:<LAN> do:<LAN> we:<LAN> have:<LAN> here:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> how:<LAN> do:<LAN> you:<LAN> call:<LAN> this:<LAN> object:<LAN> <A> (predict [FILLER]:<LAN>)
Prompt Templates for Visual Dialogue (VLM Version)
â<image> \nwhat is it ?â, â<image> \nwhat do you call this ?â, â<image> \ncan you name this object ?â, â<image> \nwhat is this called ?â, â<image> \nwhat this thing is ?â, â<image> \nwhat would you name this ?â, â<image> \nwhat is the name of this item ?â, â<image> \nhow do you identify this ?â, â<image> \nwhat do we have here ?â, â<image> \nhow do you call this object ?â
### A.2 Word Lists
CHILDES and Visual Dialog (Text Only). [box, book, ball, hand, paper, table, toy, head, car, chair, room, picture, doll, cup, towel, door, mouth, camera, duck, face, truck, bottle, puzzle, bird, tape, finger, bucket, block, stick, elephant, hat, bed, arm, dog, kitchen, spoon, hair, blanket, horse, tray, train, cow, foot, couch, necklace, cookie, plate, telephone, window, brush, ear, pig, purse, hammer, cat, shoulder, garage, button, monkey, pencil, shoe, drawer, leg, bear, milk, egg, bowl, juice, ladder, basket, coffee, bus, food, apple, bench, sheep, airplane, comb, bread, eye, animal, knee, shirt, cracker, glass, light, game, cheese, sofa, giraffe, turtle, stove, clock, star, refrigerator, banana, napkin, bunny, farm, money]
Visual Dialog (VLM). [box, book, table, toy, car, chair, doll, door, camera, duck, truck, bottle, bird, elephant, hat, bed, dog, spoon, horse, train, couch, necklace, cookie, plate, telephone, window, pig, cat, monkey, drawer, bear, milk, egg, bowl, juice, ladder, bus, food, apple, sheep, bread, animal, shirt, cheese, giraffe, clock, refrigerator, accordion, aircraft, alpaca, ambulance, ant, antelope, backpack, bagel, balloon, barrel, bathtub, beard, bee, beer, beetle, bicycle, bidet, billboard, boat, bookcase, boot, boy, broccoli, building, bull, burrito, bust, butterfly, cabbage, cabinetry, cake, camel, canary, candle, candy, cannon, canoe, carrot, cart, castle, caterpillar, cattle, cello, cheetah, chicken, chopsticks, closet, clothing, coat, cocktail, coffeemaker, coin, cosmetics]
## Appendix B Implementation Details
We outline the key implementation details in this section and provide links to the GitHub repositories:
- Model Training: https://github.com/Mars-tin/TraBank
- CHILDES Processing: https://github.com/Mars-tin/PyChildes
### B.1 Checkpointing
We save 33 checkpoints in total for text-only experiments and 16 checkpoints for the VLM setting.
CHILDES and Visual Dialog (Text Only). We save the intermediate steps: [0, 150, 300, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000]
Visual Dialog (VLM). We save the intermediate steps: [10000, 20000, 40000, 60000, 80000, 100000, 120000, 140000, 160000, 180000, 200000, 220000, 240000, 260000, 280000, 300000]
### B.2 Training details.
For the text-only Transformer, Mamba2, and LSTM models, we randomly initialize them from scratch. The training process is conducted five times, each with a different random seed (using seeds 42, 142, 242, 342, and 442, respectively). The batch size is 16.
For VLM models, we randomly initialize the language model backbone from scratch and keep the DINOv2 vision encoder frozen. The training process is conducted five times for 300k steps, each with a different random seed (using seed 42, 142, 242, 342, and 442, respectively).
All the models use a word-level tokenizer. A list of hyperparameters is shown below:
Transformer and LSTM Model.
- model_max_length: 512
- learning rate: 5e-5
- learning rate schedule: linear
- warmup_steps: 1000
- hidden_size: 768
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0
- batch_size: 16
- grad_clip_norm: 1.0
Mamba2 Model.
- model_max_length: 512
- learning rate: 4e-4
- learning rate schedule: linear
- warmup_steps: 2000
- hidden_size: 768
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0.4
- batch_size: 16
- grad_clip_norm: 1.0
VLM Model.
- model_max_length: 1024
- learning rate: 2e-5
- learning rate schedule: cosine
- warmup_steps: 9000
- hidden_size: 768
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0
- batch_size: 16
- grad_clip_norm: 1.0
### B.3 Computational resources.
Each Transformer, Mamba2, and LSTM model is trained on a single A40 GPU within 5 hours. For VLM models, training is conducted on 2 A40 GPUs over 15 hours, using a batch size of 8 per device.
## Appendix C Addendum to Results
<details>
<summary>x23.png Details</summary>

### Visual Description
## Line Graph: Proportion of "gather" and "aggregate" over Steps
### Overview
The image is a line graph comparing the proportion of two processes ("gather" and "aggregate") across incremental steps (2k to 20k). The graph shows two upward-trending lines, with "aggregate" consistently outperforming "gather" in proportion values.
### Components/Axes
- **X-axis (Step)**: Labeled "Step" with markers at 2k, 4k, 6k, ..., 20k (increments of 2k).
- **Y-axis (Proportion)**: Labeled "Proportion" with markers at 0.1, 0.2, ..., 0.6.
- **Legend**: Located in the top-left corner, with:
- Blue line labeled "gather"
- Orange line labeled "aggregate"
### Detailed Analysis
1. **Line Trends**:
- **Aggregate (Orange)**:
- Starts at ~0.1 at 2k.
- Rises sharply to ~0.62 by 20k.
- Slope is steeper than "gather".
- **Gather (Blue)**:
- Starts at ~0.05 at 2k.
- Increases gradually to ~0.38 by 20k.
- Slope is less steep than "aggregate".
2. **Key Data Points**:
- At 10k steps:
- Aggregate: ~0.5
- Gather: ~0.3
- At 14k steps:
- Aggregate: ~0.6
- Gather: ~0.35
- At 20k steps:
- Aggregate: ~0.62
- Gather: ~0.38
### Key Observations
- **Consistent Outperformance**: "Aggregate" maintains a higher proportion than "gather" across all steps.
- **Acceleration**: Both lines show increasing trends, but "aggregate" accelerates more rapidly.
- **Convergence Gap**: The proportional gap between the two lines narrows slightly at higher steps (e.g., 0.24 difference at 20k vs. 0.25 at 10k).
### Interpretation
The graph suggests that the "aggregate" process is more efficient or effective than "gather" in terms of proportion over time. The steeper slope of "aggregate" implies faster growth or higher throughput. The narrowing gap at higher steps could indicate diminishing returns for "aggregate" or saturation effects. This might reflect a comparison of data processing methods, resource allocation strategies, or workflow optimizations in a technical system. The lack of error bars or confidence intervals limits conclusions about statistical significance.
</details>
Figure 8: Gather-and-aggregate overtime.
### C.1 Behavioral Analysis
We show the complete behavioral evidence for all models in Figure 9, and co-occurrence analysis in Figure 10.
### C.2 Mechanistic Analysis
After identifying the set of gather and aggregate heads for each context, we conduct an overtime analysis to determine the proportion of saliency to the total saliency, as illustrated in Figure 8.
<details>
<summary>x24.png Details</summary>

### Visual Description
## Line Graph: Surprisal vs. Training Steps
### Overview
The image depicts a line graph comparing the relationship between "Training steps" (x-axis) and "Surprisal" (y-axis) for two scenarios: "Match" (blue line) and "Mismatch" (orange line). Both lines show a declining trend in surprisal values as training steps increase, with distinct initial trajectories and eventual convergence.
### Components/Axes
- **Y-axis (Surprisal)**: Labeled "Surprisal," scaled from 5.0 to 12.5 in increments of 2.5.
- **X-axis (Training steps)**: Labeled "Training steps," scaled from 0 to 20,000 in increments of 10,000.
- **Legend**: Positioned in the top-right corner, with:
- Blue line: "Match"
- Orange line: "Mismatch"
### Detailed Analysis
1. **Match (Blue Line)**:
- Starts at **~12.5** surprisal at 0 training steps.
- Declines sharply to **~7.5** by ~5,000 steps.
- Continues a gradual decline to **~5.0** by 20,000 steps.
- Shaded area around the line suggests uncertainty, narrowing as training progresses.
2. **Mismatch (Orange Line)**:
- Starts at **~10.0** surprisal at 0 training steps.
- Declines to **~7.5** by ~5,000 steps.
- Remains relatively flat at **~7.5** from ~10,000 to 20,000 steps.
- Shaded area is narrower than the Match line, indicating lower uncertainty.
### Key Observations
- Both lines exhibit a **steep initial decline** in surprisal, followed by a plateau.
- The Match line shows a **more pronounced early drop** compared to Mismatch.
- By 20,000 steps, both lines converge near **~5.0â7.5** surprisal, suggesting similar performance in later training stages.
- The Mismatch line demonstrates **lower initial surprisal** but slower adaptation than Match.
### Interpretation
The data suggests that both Match and Mismatch scenarios reduce surprisal (i.e., become more predictable) as training progresses. The Match scenario starts with higher surprisal, indicating it may represent a more complex or unexpected task initially. The convergence of the lines implies that after sufficient training (10,000+ steps), the modelâs ability to handle both scenarios becomes comparable. The narrower uncertainty bands for Mismatch suggest more stable learning dynamics in that scenario. The sharp early decline for Match could reflect rapid adaptation to a novel pattern, while the plateau for Mismatch might indicate a ceiling effect or inherent stability in mismatched data.
</details>
(a) 4-layer Transformer.
<details>
<summary>x25.png Details</summary>

### Visual Description
## Line Graph: Surprisal vs. Training Steps
### Overview
The image depicts a line graph comparing two data series labeled "Match" (blue) and "Mismatch" (orange) across 20,000 training steps. The y-axis measures "Surprisal" (logarithmic scale implied by rapid initial decline), while the x-axis represents training steps. Both lines exhibit distinct trends, with "Match" showing a steep initial decline followed by gradual stabilization, and "Mismatch" demonstrating a more gradual decline with sustained variability.
### Components/Axes
- **Y-axis (Surprisal)**: Ranges from 5.0 to 12.5 in increments of 2.5. No explicit units provided.
- **X-axis (Training steps)**: Spans 0 to 20,000 in increments of 10,000.
- **Legend**: Positioned in the top-right corner, with:
- Blue line: "Match"
- Orange line: "Mismatch"
### Detailed Analysis
1. **Match (Blue Line)**:
- **Initial trend**: Steep decline from ~12.5 (at 0 steps) to ~7.5 (at ~5,000 steps).
- **Midpoint**: ~6.0 at 10,000 steps.
- **Final trend**: Gradual decline to ~5.0 by 20,000 steps.
- **Variability**: Smooth, consistent downward trajectory with minimal noise.
2. **Mismatch (Orange Line)**:
- **Initial trend**: Sharp drop from ~10.0 (at 0 steps) to ~7.5 (at ~2,500 steps).
- **Midpoint**: Stabilizes near ~7.5 between 5,000 and 15,000 steps.
- **Final trend**: Slight upward fluctuation to ~7.7 by 20,000 steps.
- **Variability**: Increased noise after 10,000 steps, with minor oscillations.
### Key Observations
- Both lines share a similar initial decline rate (~2.5 surprisal units in first 5,000 steps), but diverge afterward.
- "Match" maintains a steeper, more consistent decline throughout training.
- "Mismatch" plateaus after ~5,000 steps, with a slight uptick in later stages.
- No overlapping data points between the two lines after 5,000 steps.
### Interpretation
The graph suggests that the "Match" condition demonstrates a more sustained reduction in surprisal over training, potentially indicating better adaptation or learning efficiency. The "Mismatch" condition shows initial sensitivity to training but reaches a performance ceiling, with later fluctuations possibly reflecting instability or suboptimal convergence. The logarithmic-like decline in "Match" implies exponential improvement in early stages, while the plateau in "Mismatch" may highlight inherent limitations in the mismatch scenario. These trends could reflect differences in algorithmic behavior, data alignment, or task difficulty between the two conditions.
</details>
(b) 12-layer Transformer.
<details>
<summary>x26.png Details</summary>

### Visual Description
## Line Graph: Surprisal vs. Training Steps
### Overview
The image depicts a line graph comparing the "Surprisal" metric across two conditions ("Match" and "Mismatch") over 20,000 training steps. Both lines show a general decline in surprisal, but with distinct trends in their trajectories.
### Components/Axes
- **X-axis**: "Training steps" (0 to 20,000, labeled in increments of 10,000).
- **Y-axis**: "Surprisal" (5.0 to 12.5, labeled in increments of 2.5).
- **Legend**: Located in the top-right corner, with:
- **Blue line**: "Match"
- **Orange line**: "Mismatch"
### Detailed Analysis
1. **Match (Blue Line)**:
- Starts at approximately **12.5** surprisal at 0 training steps.
- Drops sharply to ~**7.5** by 10,000 steps.
- Stabilizes near **5.0** by 20,000 steps.
- Shows a steep initial decline followed by a plateau.
2. **Mismatch (Orange Line)**:
- Begins slightly lower than "Match" at ~**12.0** surprisal at 0 steps.
- Declines gradually to ~**7.0** by 10,000 steps.
- Remains relatively flat at ~**7.0** by 20,000 steps.
- Exhibits a slower, more gradual decline compared to "Match".
### Key Observations
- Both conditions show a **decreasing trend** in surprisal over training steps.
- "Match" demonstrates a **steeper initial decline** (12.5 â 5.0) compared to "Mismatch" (12.0 â 7.0).
- After ~10,000 steps, "Match" plateaus at a lower surprisal value than "Mismatch".
- The orange line ("Mismatch") exhibits **greater variability** in its early trajectory (e.g., minor fluctuations between 7.5 and 8.0).
### Interpretation
The graph suggests that the "Match" condition adapts more efficiently to the training process, achieving lower surprisal values earlier and maintaining stability. The "Mismatch" condition, while also improving, retains higher surprisal values, potentially indicating:
- **Slower learning dynamics** or **greater complexity** in the mismatch scenario.
- **Persistent uncertainty** in the mismatch case, even after extensive training.
- The divergence between the two lines highlights the impact of condition-specific factors (e.g., data alignment, task difficulty) on model performance. The plateau in both lines implies diminishing returns in surprisal reduction beyond ~10,000 steps.
</details>
(c) 18-layer Transformer.
<details>
<summary>x27.png Details</summary>

### Visual Description
## Line Graph: Surprisal Trends in Match vs. Mismatch Training
### Overview
The graph illustrates the evolution of "Surprisal" values over "Training steps" for two scenarios: "Match" (blue line) and "Mismatch" (orange line). Surprisal is measured on the y-axis (5.0â12.5), while training steps span the x-axis (0â20,000). The legend is positioned in the top-right corner, with blue representing "Match" and orange representing "Mismatch."
### Components/Axes
- **X-axis (Training steps)**: Labeled "Training steps," ranging from 0 to 20,000 in increments of 10,000.
- **Y-axis (Surprisal)**: Labeled "Surprisal," ranging from 5.0 to 12.5 in increments of 2.5.
- **Legend**: Located in the top-right corner, with:
- Blue line: "Match"
- Orange line: "Mismatch"
### Detailed Analysis
1. **Match (Blue Line)**:
- **Initial Drop**: Starts at approximately 12.5 surprisal at 0 training steps, sharply declining to ~5.0 by 5,000 steps.
- **Stabilization**: Remains near 5.0 with minor fluctuations (e.g., slight dips to ~4.5 between 5,000â10,000 steps) until 20,000 steps.
- **Final Value**: Ends at ~5.0 surprisal.
2. **Mismatch (Orange Line)**:
- **Initial Value**: Begins at ~7.5 surprisal at 0 steps, rising gradually to ~9.0 by 20,000 steps.
- **Trend**: Shows a steady upward trajectory with minor plateaus (e.g., ~7.8 at 10,000 steps, ~8.5 at 15,000 steps).
- **Final Value**: Ends at ~9.0 surprisal.
### Key Observations
- **Divergence**: The "Match" line diverges sharply from the "Mismatch" line in the first 5,000 steps, while the "Mismatch" line remains relatively stable until later stages.
- **Stabilization**: Both lines stabilize after ~10,000 steps, but "Mismatch" continues to increase slowly.
- **Surprisal Dynamics**: "Match" surprisal decreases significantly, suggesting reduced uncertainty or better alignment with training data, while "Mismatch" surprisal increases, indicating growing uncertainty or misalignment.
### Interpretation
The graph suggests that training under "Match" conditions leads to a rapid reduction in surprisal, likely due to effective learning or alignment with expected patterns. In contrast, "Mismatch" conditions result in sustained or increasing surprisal, implying the model struggles to adapt to less relevant or conflicting data. The divergence highlights the impact of training data relevance on model performance, with "Match" scenarios favoring stability and "Mismatch" scenarios introducing persistent uncertainty. The gradual rise in "Mismatch" surprisal after 10,000 steps may indicate delayed recognition of data misalignment or compounding errors.
</details>
(d) 12-layer Mamba 2.
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Graph: Surprisal vs Training Steps
### Overview
The graph depicts two data series ("Match" and "Mismatch") plotted against training steps (0â20,000) on the x-axis and surprisal values (4.0â12.5) on the y-axis. Both lines show distinct trends, with "Match" declining sharply initially and stabilizing, while "Mismatch" remains relatively flat after an initial dip.
### Components/Axes
- **X-axis**: "Training steps" (0, 10,000, 20,000)
- **Y-axis**: "Surprisal" (5.0, 7.5, 10.0, 12.5)
- **Legend**: Located in the top-right corner, with:
- Blue line: "Match"
- Orange line: "Mismatch"
- **Shading**: Light blue and orange bands around lines indicate variability/confidence intervals.
### Detailed Analysis
1. **Match (Blue Line)**:
- Starts at ~12.5 surprisal at 0 steps.
- Drops sharply to ~4.5 surprisal by 10,000 steps.
- Stabilizes with minor fluctuations (~4.0â4.5) between 10,000â20,000 steps.
- Shaded area narrows significantly after the initial drop, suggesting reduced variability.
2. **Mismatch (Orange Line)**:
- Begins at ~7.5 surprisal at 0 steps.
- Dips slightly to ~6.5 surprisal by ~2,000 steps.
- Remains flat (~7.0â7.5) from 2,000â20,000 steps.
- Shaded area remains consistent, indicating stable variability.
### Key Observations
- **Initial Divergence**: "Match" starts with significantly higher surprisal than "Mismatch" (~12.5 vs. ~7.5).
- **Rapid Adaptation**: "Match" surprisal decreases ~60% in the first 10,000 steps, then plateaus.
- **Stability**: "Mismatch" surprisal shows minimal change after the initial dip, remaining ~7.0â7.5 throughout training.
- **Convergence**: By 20,000 steps, "Match" surprisal (~4.5) is ~40% lower than "Mismatch" (~7.5).
### Interpretation
The data suggests that the "Match" condition undergoes rapid adaptation during early training, reducing surprisal (likely indicating improved model performance or prediction accuracy) before stabilizing. In contrast, "Mismatch" shows limited adaptation, maintaining higher surprisal values throughout training. This could imply that "Match" scenarios are more amenable to learning or optimization, while "Mismatch" scenarios resist change, possibly due to conflicting patterns or noise. The shaded variability bands suggest that "Match" becomes more predictable over time, whereas "Mismatch" remains uncertain.
</details>
(e) 4-layer Mamba 2.
<details>
<summary>x29.png Details</summary>

### Visual Description
## Line Graph: Surprisal vs Training Steps
### Overview
The graph depicts the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis) for two conditions: "Match" (blue line) and "Mismatch" (orange line). Both lines show a declining trend over 20,000 training steps, with the "Match" condition consistently outperforming the "Mismatch" condition.
### Components/Axes
- **Y-axis (Surprisal)**: Ranges from 5.0 to 12.5 in increments of 2.5.
- **X-axis (Training steps)**: Ranges from 0 to 20,000 in increments of 10,000.
- **Legend**: Located in the top-right corner, with:
- Blue line labeled "Match"
- Orange line labeled "Mismatch"
### Detailed Analysis
1. **Initial Peak (0â2,000 steps)**:
- Both lines start near **12.5 surprisal** at 0 steps.
- The "Match" line (blue) drops sharply to ~10.0 by 2,000 steps.
- The "Mismatch" line (orange) declines more gradually to ~10.5 by 2,000 steps.
2. **Midpoint (2,000â10,000 steps)**:
- "Match" line continues declining to ~8.0 by 10,000 steps.
- "Mismatch" line plateaus slightly above ~8.5 by 10,000 steps.
3. **Final Phase (10,000â20,000 steps)**:
- "Match" line stabilizes near **7.5 surprisal**.
- "Mismatch" line remains flat at ~8.0 surprisal.
### Key Observations
- The "Match" condition achieves a **~40% reduction** in surprisal (from 12.5 to 7.5) over 20,000 steps.
- The "Mismatch" condition shows a **~40% reduction** (from 12.5 to 8.0) but maintains higher surprisal throughout training.
- Both lines exhibit **asymptotic behavior**, suggesting diminishing returns in surprisal reduction after ~10,000 steps.
### Interpretation
The data demonstrates that models trained on "Match" data achieve lower surprisal (better performance) compared to "Mismatch" data. The steeper initial decline in the "Match" line indicates faster adaptation to predictable patterns, while the "Mismatch" lineâs plateau suggests persistent uncertainty or noise in the training data. This aligns with principles of information theory, where lower surprisal reflects higher predictability. The divergence between conditions highlights the importance of data quality in model training.
</details>
(f) 4-layer LSTM.
Figure 9: Average surprisal of the experimental and control conditions over training steps.
<details>
<summary>x30.png Details</summary>

### Visual Description
## Line Graph: Information Gain vs R² Values Over Training Steps
### Overview
The graph depicts two metrics tracked during training: **R² values** (left y-axis) and **Information gain** (right y-axis), plotted against **Training steps** (x-axis from 0 to 20,000). The blue line represents Information gain, while the orange line represents R² values. The legend is positioned in the top-left corner.
### Components/Axes
- **X-axis**: Training steps (0 to 20,000, linear scale).
- **Left Y-axis**: R² values (0 to 0.8, linear scale).
- **Right Y-axis**: Information gain (0 to 6, linear scale).
- **Legend**:
- Blue line: Information gain.
- Orange line: R² value.
### Detailed Analysis
1. **R² Values (Orange Line)**:
- Starts near 0 at 0 steps.
- Peaks at ~0.35 around 5,000 steps.
- Declines steadily to ~0.05 by 20,000 steps.
- Shaded area (confidence interval) widens initially, then narrows.
2. **Information Gain (Blue Line)**:
- Begins at 0 and increases monotonically.
- Reaches ~2.5 by 20,000 steps.
- Slope flattens after ~15,000 steps.
3. **Key Intersection**:
- Lines cross near 10,000 steps, where both metrics are ~0.25 (R²) and ~2 (Information gain).
### Key Observations
- **Inverse Relationship**: R² values peak early and decline, while Information gain rises steadily.
- **Divergence**: After 10,000 steps, R² values drop below 0.2, while Information gain continues to increase.
- **Saturation**: Information gain plateaus near 2.5 after 15,000 steps, suggesting diminishing returns.
### Interpretation
The data suggests a trade-off between model performance (R²) and information-theoretic efficiency (Information gain). The initial rise in R² indicates improving model fit, but its subsequent decline implies overfitting or diminishing returns in capturing data variance. Meanwhile, Information gainâs steady increase suggests the model is learning new patterns, but these may not translate to better R² performance. This could indicate:
- **Overfitting**: The model prioritizes memorizing noise over generalizable patterns.
- **Metric Misalignment**: R² may not fully capture the modelâs utility if the data has complex, non-linear relationships.
- **Resource Allocation**: Further training steps yield minimal R² gains but continue to extract information, possibly at the cost of generalization.
The divergence highlights the need to balance model complexity with validation metrics, especially in scenarios where Information gain (e.g., feature importance) is prioritized over traditional performance metrics like R².
</details>
(a) 4-layer Transformer.
<details>
<summary>x31.png Details</summary>

### Visual Description
## Line Graph: Training Step Analysis
### Overview
The image depicts a dual-axis line graph comparing two metrics across training steps: **R² values** (left y-axis) and **Information gain** (right y-axis). The x-axis represents training steps from 0 to 20,000. Two lines are plotted: a blue line for Information gain and an orange line for R² values, with shaded regions indicating variability or confidence intervals.
### Components/Axes
- **X-axis**: Training steps (0 to 20,000, linear scale).
- **Left Y-axis**: R² values (0 to 0.8, linear scale).
- **Right Y-axis**: Information gain (0 to 6, linear scale).
- **Legend**: Located in the top-left corner, with:
- Blue line: "Information gain"
- Orange line: "R² value"
### Detailed Analysis
1. **R² Value (Orange Line)**:
- Starts at 0.0 at 0 steps.
- Peaks sharply at ~0.4 around 5,000 steps.
- Declines steadily to ~0.1 by 20,000 steps.
- Shaded region (confidence interval) widens as R² decreases, indicating increasing uncertainty.
2. **Information Gain (Blue Line)**:
- Begins near 0 at 0 steps.
- Gradually increases, crossing the R² peak at ~5,000 steps.
- Plateaus at ~2.5â3.0 by 20,000 steps.
- Shaded region widens significantly after 10,000 steps, suggesting growing variability.
### Key Observations
- **Divergence at 5,000 steps**: R² peaks while Information gain continues rising, indicating a trade-off between model performance and information efficiency.
- **Post-10,000 steps**: R² declines sharply, while Information gain stabilizes, suggesting diminishing returns in model accuracy despite continued learning.
- **Confidence intervals**: Information gainâs uncertainty grows more pronounced than R²âs, implying less reliable gains at higher training steps.
### Interpretation
The graph highlights a critical tension in model training:
- **Early training (0â5,000 steps)**: Rapid improvement in both metrics, with R² reflecting strong initial performance gains.
- **Mid-to-late training (5,000â20,000 steps)**: R² degradation suggests overfitting or saturation, while Information gainâs plateau implies the model is extracting diminishing new insights. The widening confidence intervals for Information gain may indicate instability in later training phases.
This pattern underscores the need for careful hyperparameter tuning to balance model complexity (Information gain) and generalization (R²). The divergence after 5,000 steps warns against excessive training, which risks overfitting despite apparent learning progress.
</details>
(b) 12-layer Transformer.
<details>
<summary>x32.png Details</summary>

### Visual Description
## Line Graph: Information Gain vs R² Values Over Training Steps
### Overview
The image depicts a dual-axis line graph comparing two metricsâ**Information gain** (blue line) and **R² value** (orange line)âacross **20,000 training steps**. The left y-axis represents R² values (0â0.8), while the right y-axis represents Information gain (0â6). The graph includes shaded confidence intervals for both lines.
---
### Components/Axes
- **X-axis**: Training steps (0 to 20,000, linear scale).
- **Left Y-axis**: R² values (0â0.8, linear scale).
- **Right Y-axis**: Information gain (0â6, linear scale).
- **Legend**: Located in the top-left corner, with:
- **Blue line**: Information gain.
- **Orange line**: R² value.
- **Shading**: Confidence intervals (light blue for Information gain, light orange for R²).
---
### Detailed Analysis
#### R² Values (Orange Line)
- **Initial Rise**: R² increases sharply from ~0.0 to ~0.4 between 0 and 5,000 training steps.
- **Peak**: Reaches a maximum of ~0.4 at ~5,000 steps.
- **Decline**: Gradually decreases to ~0.1 by 20,000 steps, with a shaded confidence interval narrowing over time.
#### Information Gain (Blue Line)
- **Steady Growth**: Increases monotonically from ~0.0 to ~2.0 across all training steps.
- **Plateau**: Flattens near ~2.0 after ~15,000 steps, with a widening confidence interval at later steps.
---
### Key Observations
1. **Divergence After Peak**: R² peaks early (~5,000 steps) and declines, while Information gain continues rising.
2. **Confidence Intervals**: R²âs uncertainty decreases after the peak, while Information gainâs uncertainty increases post-15,000 steps.
3. **Scale Disparity**: Information gain values (~2) are ~25Ă larger than R² values (~0.1â0.4) at later steps.
---
### Interpretation
- **Trade-off Between Metrics**: The divergence suggests that Information gain and R² measure different aspects of model performance. R² (variance explained) plateaus early, while Information gain (potentially capturing feature relevance or predictive power) grows steadily.
- **Overfitting Hypothesis**: The decline in R² after 5,000 steps may indicate overfitting, as the model becomes overly complex relative to the data. Meanwhile, Information gainâs continued growth implies the model retains or discovers new meaningful patterns.
- **Practical Implication**: Relying solely on R² could mislead optimization, as Information gain provides a more nuanced view of model utility in later training stages.
---
### Spatial Grounding
- **Legend**: Top-left corner, clearly associating colors with metrics.
- **Secondary Y-axis**: Right side, aligned with Information gain values.
- **Line Placement**: Blue (Information gain) consistently above orange (R²) after ~10,000 steps.
---
### Content Details
- **R² Peak**: ~0.4 at 5,000 steps (uncertainty ¹0.05).
- **Information Gain Plateau**: ~2.0 at 20,000 steps (uncertainty Âą0.2).
- **Cross-Reference**: Blue line (Information gain) matches legend; orange line (R²) matches legend.
---
### Key Observations (Reiterated)
- R² and Information gain trends are inversely related after 5,000 steps.
- Information gainâs confidence interval widens significantly after 15,000 steps, suggesting increased variability in metric estimation.
</details>
(c) 18-layer Transformer.
<details>
<summary>x33.png Details</summary>

### Visual Description
## Line Graph: Model Performance Metrics Over Training Steps
### Overview
The image depicts a line graph comparing two metricsâ**Information gain** and **R² values**âacross 20,000 training steps. The graph includes two y-axes: the left axis (orange) represents R² values (0â0.8), and the right axis (blue) represents Information gain (0â6). A legend in the top-left corner distinguishes the two metrics.
---
### Components/Axes
- **X-axis**: Training steps (0 to 20,000, linear scale).
- **Left Y-axis**: R² values (0â0.8, linear scale).
- **Right Y-axis**: Information gain (0â6, linear scale).
- **Legend**:
- Blue line: Information gain.
- Orange line: R² value.
- **Placement**: Legend is top-left; axes are labeled with clear titles.
---
### Detailed Analysis
1. **R² Values (Orange Line)**:
- Starts at **0.0** at 0 steps.
- Peaks sharply at **~0.4** around 5,000 steps.
- Drops to **~0.05** by 10,000 steps and remains flat through 20,000 steps.
- Shaded area (uncertainty) narrows after the initial peak.
2. **Information Gain (Blue Line)**:
- Starts at **0.0** at 0 steps.
- Rises steadily to **~4.0** by 5,000 steps.
- Plateaus at **~4.5** by 20,000 steps.
- Shaded area (uncertainty) widens slightly after 10,000 steps.
---
### Key Observations
- **Inverse Relationship**: R² values peak early (5,000 steps) and decline, while Information gain increases monotonically.
- **Divergence**: After 5,000 steps, R² values drop sharply (~0.4 â 0.05), while Information gain continues to rise (~4.0 â 4.5).
- **Stability**: Both metrics stabilize after 10,000 steps, with minimal further change.
---
### Interpretation
- **R² Decline**: The sharp drop in R² after 5,000 steps suggests the modelâs predictive power diminishes as training progresses, potentially due to overfitting or diminishing returns.
- **Information Gain Rise**: The steady increase in Information gain indicates the model is learning to extract more meaningful patterns from the data over time, even as predictive accuracy (R²) declines.
- **Trade-off**: The divergence implies a potential trade-off between model complexity (higher Information gain) and generalization (lower R²). This could reflect a scenario where the model becomes more efficient at utilizing data but less accurate in predictions, possibly due to over-optimization for specific features.
The graph highlights a critical tension in model training: balancing immediate predictive performance (R²) with long-term data efficiency (Information gain).
</details>
(d) 12-layer Mamba 2.
<details>
<summary>x34.png Details</summary>

### Visual Description
## Line Graph: Training Metrics Over Steps
### Overview
The image depicts a line graph comparing two metricsâ**Information gain** and **R² values**âacross **20,000 training steps**. The graph includes two y-axes: the left axis (orange) for R² values (0â0.8) and the right axis (blue) for Information gain (0â6). A legend in the top-left corner distinguishes the two metrics by color.
---
### Components/Axes
- **X-axis**: Training steps (0 to 20,000, linear scale).
- **Left Y-axis (Orange)**: R² values (0 to 0.8).
- **Right Y-axis (Blue)**: Information gain (0 to 6).
- **Legend**:
- Blue line: Information gain.
- Orange line: R² value.
- **Shading**: Light blue and orange bands around the lines suggest confidence intervals or variability.
---
### Detailed Analysis
1. **R² Value (Orange Line)**:
- Starts at ~0.6 at step 0.
- Peaks sharply at ~0.75 around step 5,000.
- Declines steeply to ~0.05 by step 10,000.
- Remains near 0.05 until step 20,000.
2. **Information Gain (Blue Line)**:
- Starts at 0 at step 0.
- Increases monotonically, reaching ~4 by step 15,000.
- Plateaus at ~4 from step 15,000 to 20,000.
---
### Key Observations
- **Inverse Relationship Early On**: R² values drop sharply as Information gain rises (steps 0â10,000).
- **Divergence Post-10,000 Steps**: R² stabilizes near 0.05 while Information gain plateaus at ~4.
- **Peak Discrepancy**: R² peaks at step 5,000, while Information gain peaks at step 15,000.
---
### Interpretation
- **Early Training (Steps 0â5,000)**: High R² suggests the model initially fits the data well, but Information gain remains low, indicating limited feature relevance.
- **Mid-Training (Steps 5,000â15,000)**: R² plummets, likely due to overfitting or model complexity, while Information gain surges, implying the model begins capturing meaningful patterns.
- **Late Training (Steps 15,000â20,000)**: Both metrics stabilize. The low R² suggests poor generalization, but sustained Information gain implies the model retains useful feature relationships despite poor fit.
- **Anomaly**: The sharp R² drop after step 5,000 contrasts with the gradual Information gain rise, hinting at a potential trade-off between model fit and feature utility.
---
### Spatial Grounding
- **Legend**: Top-left corner, clearly associating colors with metrics.
- **Secondary Y-axis**: Right side, ensuring dual-scale clarity.
- **Line Placement**: Blue (Information gain) dominates the upper half; orange (R²) occupies the lower half.
---
### Content Details
- **R² Values**:
- Step 0: ~0.6
- Step 5,000: ~0.75 (peak)
- Step 10,000: ~0.05
- Step 20,000: ~0.05
- **Information Gain**:
- Step 0: 0
- Step 15,000: ~4 (peak)
- Step 20,000: ~4
---
### Key Observations (Reiterated)
- The graph highlights a critical phase shift in model behavior around step 5,000, where R² and Information gain diverge sharply.
- The late-stage plateau in Information gain suggests diminishing returns in feature relevance despite continued training.
---
### Interpretation (Expanded)
- **Model Behavior**: The divergence between R² and Information gain implies that early training prioritizes data fit (high R²) over meaningful feature extraction (low Information gain). Later, the model shifts focus to capturing feature relationships (high Information gain) at the cost of generalization (low R²).
- **Practical Implications**: This pattern may indicate a need for regularization or early stopping to balance fit and feature utility. The late-stage stability suggests the model has exhausted its capacity to learn new patterns.
</details>
(e) 4-layer Mamba 2.
<details>
<summary>x35.png Details</summary>

### Visual Description
## Line Graph: Model Performance Metrics Over Training Steps
### Overview
The image displays a dual-axis line graph tracking two performance metrics during model training: R² values (left y-axis) and Information gain (right y-axis) across 20,000 training steps. The graph includes a legend in the top-right corner and shaded uncertainty bands for the R² metric.
### Components/Axes
- **X-axis**: Training steps (0 to 20,000)
- **Left Y-axis**: R² values (0.0 to 0.8)
- **Right Y-axis**: Information gain (0 to 6)
- **Legend**:
- Blue line: Information gain
- Orange line: R² value
- **Shaded Area**: Uncertainty band around R² values (top-right corner)
### Detailed Analysis
1. **R² Values (Orange Line)**:
- Starts at 0.0 at step 0
- Rapidly increases to ~0.6 by 10,000 steps
- Plateaus between 0.6 and 0.75 after 10,000 steps
- Shaded uncertainty band widens initially, then narrows as training progresses
2. **Information Gain (Blue Line)**:
- Starts at 0.0 at step 0
- Gradual linear increase to ~1.2 by 20,000 steps
- Slope remains relatively constant throughout training
### Key Observations
- R² values show diminishing returns after ~10,000 steps, while Information gain continues increasing linearly
- The orange line's shaded uncertainty band suggests measurement variability decreases with more training
- Information gain metric scales 10x higher than R² values (6 vs 0.8 on respective axes)
### Interpretation
The data demonstrates two distinct learning phases:
1. **Early Training (0-10k steps)**:
- R² values show rapid improvement (0â0.6), indicating strong initial learning
- Information gain increases slowly (0â1.2), suggesting limited feature importance discovery
2. **Late Training (10k-20k steps)**:
- R² plateaus near 0.7, implying model saturation
- Information gain continues rising linearly (1.2â2.4), indicating ongoing discovery of subtle patterns
The divergence between metrics suggests potential overfitting risks: while predictive power (R²) stabilizes, the model continues accumulating information (possibly noise or irrelevant features). The uncertainty band around R² values highlights measurement reliability improvements with more training data.
</details>
(f) 4-layer LSTM.
Figure 10: Grounding information gain and its correlation to the co-occurrence of linguistic and environment tokens over training steps.