# The Mechanistic Emergence of Symbol Grounding in Language Models
**Authors**:
- Freda Shi Joyce Chai (University of Michiganâ
University of Waterlooâ
Vector Instituteâ
UNC at Chapel Hill)
Abstract
Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation. footnotetext: Authors contributed equally to this work. footnotetext: Advisors contributed equally to this work.
1 Introduction
Symbol grounding (Harnad, 1990) refers to the problem of how abstract and discrete symbols, such as words, acquire meaning by connecting to perceptual or sensorimotor experiences. Extending to the context of multimodal machine learning, grounding has been leveraged as an explicit pre-training objective for vision-language models (VLMs), by explicitly connecting linguistic units to the world that gives language meanings (Li et al., 2022; Ma et al., 2023). Through supervised fine-tuning with grounding signals, such as entity-phrase mappings, modern VLMs have achieved fine-grained understanding at both region (You et al., 2024; Peng et al., 2024; Wang et al., 2024) and pixel (Zhang et al., 2024b; Rasheed et al., 2024; Zhang et al., 2024a) levels.
With the rising of powerful autoregressive language models (LMs; OpenAI, 2024; Anthropic, 2024; Comanici et al., 2025, inter alia) and their VLM extensions, there is growing interest in identifying and interpreting their emergent capabilities. Recent work has shown preliminary correlational evidence that grounding may emerge in LMs (Sabet et al., 2020; Shi et al., 2021; Wu et al., 2025b) and VLMs (Cao et al., 2025; Bousselham et al., 2024; Schnaus et al., 2025) trained at scale, even when solely optimized with the simple next-token prediction objective. However, the potential underlying mechanisms that lead to such an emergence are not well understood. To address this limitation, our work seeks to understand the emergence of symbol grounding in LMs, causally and mechanistically tracing how symbol grounding arises within the internal computations.
We begin by constructing a minimal testbed, motivated by the annotations provided in the CHILDES corpora (MacWhinney, 2000), where childâcaregiver interactions provide cognitively plausible contexts for studying symbol grounding alongside verbal utterances. In our framework, each word is represented in two distinct forms: one token that appears in non-verbal scene descriptions (e.g., a box in the environment) and another that appears in spoken utterances (e.g., box in dialogue). We refer to these as environmental tokens ( $\langle$ ENV $\rangle$ ) and linguistic tokens ( $\langle$ LAN $\rangle$ ), respectively. A deliberately simple word-level tokenizer assigns separate vocabulary entries to each form, ensuring that they are treated as entirely different tokens by the language model. This framework enforces a structural separation between scenes and symbols, preventing correspondences from being reduced to trivial token identity. Under this setup, we can evaluate whether a model trained from scratch is able to predict the linguistic form from its environmental counterpart.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Grounding of Environmental and Linguistic Tokens
### Overview
The image depicts a diagram illustrating the grounding of environmental and linguistic tokens, specifically relating to the concept of a horse. It shows two horizontal rows of labeled boxes representing "Environmental Tokens" and "Linguistic Tokens" respectively, with an arrow indicating a grounding connection between the word "horse" in both rows.
### Components/Axes
The diagram consists of two main sections:
* **Environmental Tokens (<ENV>):** Located at the top, this row contains boxes labeled with individual words: `<CHI>painted<ENV>`, `a<ENV>`, `picture<ENV>`, `of<ENV>`, `a<ENV>`, and `horse<ENV>`.
* **Linguistic Tokens (<LAN>):** Located at the bottom, this row contains boxes labeled with individual words: `<CHI>my<LAN>`, `favorite<LAN>`, `animal<LAN>`, `is<LAN>`, `the<LAN>`, and `horse<LAN>`.
* **Grounding (Information Aggregation):** A label in the center of the diagram describes the arrow as "Grounding (Information Aggregation)".
* **Arrow:** A green arrow points from the "horse" token in the Environmental Tokens row to the "horse" token in the Linguistic Tokens row.
* **Dotted Box:** A dotted box surrounds the "horse" token in the Linguistic Tokens row.
### Detailed Analysis or Content Details
The diagram demonstrates a connection between the perception of a horse (environmental token) and its linguistic representation (linguistic token).
* **Environmental Tokens:** The sequence "painted a picture of a horse" suggests an observation or description of a visual scene.
* **Linguistic Tokens:** The sequence "my favorite animal is the horse" represents a statement about preference.
* **Grounding:** The arrow indicates that the linguistic token "horse" is grounded in the environmental token "horse," meaning the word refers to the actual object or concept.
### Key Observations
The diagram highlights the process of grounding, where language is connected to real-world perception. The use of `<CHI>` tags suggests these tokens are associated with a child's language acquisition. The dotted box around the "horse" token in the Linguistic Tokens row may indicate the focus of the grounding process.
### Interpretation
This diagram illustrates a fundamental concept in cognitive science and natural language processing: the grounding problem. It shows how language is not merely a symbolic system but is connected to our experiences and perceptions of the world. The diagram suggests that understanding language requires linking words to their referents in the environment. The use of child-directed language tags (`<CHI>`) implies this process is crucial for language development. The diagram is a simplified representation of a complex cognitive process, but it effectively conveys the core idea of grounding.
</details>
(a) Attention head 8 of layer 7 in GPT-CHILDES.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: Environmental and Linguistic Token Grounding
### Overview
The image depicts a diagram illustrating the concept of grounding between "Environmental Tokens" (represented by an image of an alpaca) and "Linguistic Tokens" (represented by a sequence of text blocks). A yellow arrow visually connects the alpaca image to the text "what would you name this ? alpaca". The diagram highlights "Information Aggregation" as the process occurring during grounding.
### Components/Axes
The diagram consists of three main components:
1. **Environmental Tokens (<ENV>):** A photograph of an alpaca in an outdoor setting. The alpaca is light brown/white, standing in a dirt/sand area with a wooden fence and sparse trees in the background.
2. **Grounding (Information Aggregation):** A yellow arrow visually connecting the alpaca image to the Linguistic Tokens. The text "Grounding (Information Aggregation)" is positioned between the image and the text blocks.
3. **Linguistic Tokens (<LAN>):** A series of dark blue rectangular blocks containing the text: "what", "would", "you", "name", "this", "?", "alpaca". The blocks are arranged horizontally. A dashed box surrounds the last block ("alpaca").
### Detailed Analysis or Content Details
The diagram demonstrates a connection between a visual stimulus (the alpaca image) and a linguistic query ("what would you name this ? alpaca"). The yellow arrow indicates that the linguistic tokens are grounded in the environmental tokens. The text sequence suggests a question being posed about the alpaca.
The text blocks are arranged in a linear sequence, representing a sentence or phrase. The question mark indicates an interrogative sentence. The final block, "alpaca", is highlighted with a dashed border, potentially emphasizing the subject of the question.
### Key Observations
The diagram visually represents the process of grounding, where linguistic information is linked to perceptual information. The use of distinct labels (<ENV>, <LAN>) and the "Grounding" label clearly define the components and their relationship. The dashed box around "alpaca" suggests its importance in the grounding process.
### Interpretation
This diagram illustrates a core concept in multimodal AI and cognitive science: grounding. Grounding refers to the process by which symbols (words, phrases) acquire meaning through their connection to perceptual experiences (images, sounds, etc.). In this case, the linguistic tokens ("what would you name this ? alpaca") are grounded in the environmental token (the image of the alpaca). The "Information Aggregation" label suggests that the grounding process involves combining information from both modalities to create a coherent representation.
The diagram suggests a scenario where a system is attempting to understand or interact with the environment by associating language with visual objects. The question posed ("what would you name this ? alpaca") implies an intention to elicit a response that demonstrates understanding of the alpaca's identity. The diagram is a simplified representation of a complex cognitive process, but it effectively conveys the fundamental idea of grounding.
</details>
(b) Attention head 7 of layer 20 in LLaVA-1.5-7B.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Heatmap: Salience Map with Linguistic Input
### Overview
The image presents a heatmap visualizing salience, likely representing attention or importance, across a grid corresponding to a sentence. The grid is 12x12, indexed by 'layer' (1-12) on the vertical axis and 'head' (1-12) on the horizontal axis. A textual representation of a sentence is displayed alongside the heatmap, with lines connecting specific words to corresponding cells in the grid. A colorbar on the right indicates salience values ranging from 0.0 to 0.3.
### Components/Axes
* **X-axis (Head):** Numbered 1 to 12, representing positions within the sentence.
* **Y-axis (Layer):** Numbered 1 to 12, representing layers or dimensions of analysis.
* **Colorbar:** Represents salience values.
* 0.0: Dark Purple
* 0.3: Yellow-Green
* **Text:** The sentence is: `<CHI> painted a picture of a horse <CHI> my favorite animal is the`. The tags `<CHI>`, `<ENV>`, and `<LAN>` are present.
* **Lines:** Two lines connect words in the sentence to specific cells in the heatmap grid.
### Detailed Analysis
The heatmap is predominantly dark purple, indicating low salience across most cells. There are two areas of higher salience (yellow-green):
* **Connection 1:** A line originates from the word "horse" and connects to the cell at approximately (head=7, layer=6). The salience value at this cell is approximately 0.28.
* **Connection 2:** A line originates from the word "favorite" and connects to the cell at approximately (head=10, layer=11). The salience value at this cell is approximately 0.25.
The rest of the heatmap shows salience values generally below 0.1, with a consistent dark purple color. There is a slight increase in salience around the "horse" area, but it's localized.
### Key Observations
* The words "horse" and "favorite" appear to be the most salient elements in the sentence, as indicated by the heatmap.
* The salience values are relatively low overall, suggesting that the sentence as a whole doesn't elicit strong attention.
* The lines connecting words to the heatmap cells suggest a mapping between linguistic elements and a representation of their importance.
* The tags `<CHI>`, `<ENV>`, and `<LAN>` suggest the sentence is part of a larger linguistic dataset, potentially related to child language (`<CHI>`), environment (`<ENV>`), and language (`<LAN>`).
### Interpretation
This heatmap likely represents an attention mechanism or salience map derived from a neural network or computational model processing the sentence. The higher salience values for "horse" and "favorite" suggest that these words are considered more important or attention-grabbing within the context of the sentence. The use of tags like `<CHI>`, `<ENV>`, and `<LAN>` indicates that this data is likely part of a larger study on language acquisition or processing, potentially focusing on how children perceive and attend to different words in a sentence. The heatmap provides a visual representation of which words are most salient, offering insights into the model's understanding of the sentence's meaning and structure. The low overall salience values could indicate a relatively neutral or unremarkable sentence, or it could be a characteristic of the model's attention distribution. The lines connecting the words to the heatmap cells are crucial for understanding the mapping between linguistic input and the model's internal representation of salience.
</details>
(c) Left: saliency over tokens of each head in each layer for the prompt $\langle$ CHI $\rangle$ $\textit{painted}_{\texttt{$\langle$ENV$\rangle$}}$ $\textit{a}_{\texttt{$\langle$ENV$\rangle$}}$ $\textit{picture}_{\texttt{$\langle$ENV$\rangle$}}$ $\textit{of}_{\texttt{$\langle$ENV$\rangle$}}$ $\textit{a}_{\texttt{$\langle$ENV$\rangle$}}$ $\textit{horse}_{\texttt{$\langle$ENV$\rangle$}}$ $\langle$ CHI $\rangle$ $\textit{my}_{\texttt{$\langle$LAN$\rangle$}}$ $\textit{favorite}_{\texttt{$\langle$LAN$\rangle$}}$ $\textit{animal}_{\texttt{$\langle$LAN$\rangle$}}$ $\textit{is}_{\texttt{$\langle$LAN$\rangle$}}$ $\textit{the}_{\texttt{$\langle$LAN$\rangle$}}$ . Right: among all, only one of them (head 8 of layer 7) is identified as an aggregate head, where information flows from $\textit{horse}_{\texttt{$\langle$ENV$\rangle$}}$ to the current position, encouraging the model to predict $\textit{horse}_{\texttt{$\langle$LAN$\rangle$}}$ as the next token.
Figure 1: Illustration of the symbol grounding mechanism through information aggregation. Lighter colors denote more salient attention, quantified by saliency scores, i.e., gradient $Ă$ attention contributions to the loss (Wang et al., 2023). When predicting the next token, aggregate heads (Bick et al., 2025) emerge to exclusively link environmental tokens (visual or situational context; $\langle$ ENV $\rangle$ ) to linguistic tokens (words in text; $\langle$ LAN $\rangle$ ). These heads provide a mechanistic pathway for symbol grounding by mapping external environmental evidence into its linguistic form.
We quantify the level of grounding using surprisal: specifically, we compare how easily the model predicts a linguistic token ( $\langle$ LAN $\rangle$ ) when its matching environmental token ( $\langle$ ENV $\rangle$ ) is present versus when unrelated cues are given instead. A lower surprisal in the former condition indicates that the model has learned to align environmental grounds with linguistic forms. We find that LMs do learn to ground: the presence of environmental tokens consistently reduces surprisal for their linguistic counterparts, in a way that simple co-occurrence statistics cannot fully explain. To study the underlying mechanisms, we apply saliency analysis (Wang et al., 2023) and the tuned lens (Belrose et al., 2023), which converge on the result that grounding relations are concentrated in the middle layers of the network. Further analysis of attention heads reveals patterns consistent with the aggregate mechanism (Bick et al., 2025), where attention heads support the prediction of linguistic forms by retrieving their environmental grounds in the context.
Finally, we demonstrate that these findings generalize beyond the minimal CHILDES data and Transformer models. They appear in a multimodal setting with the Visual Dialog dataset (Das et al., 2017), and in state-space models (SSMs) such as Mamba-2 (Dao & Gu, 2024). In contrast, we do not observe grounding in unidirectional LSTMs, consistently with their sequential state compression and lack of content-addressable retrieval. Taken together, our results show that symbol grounding can mechanistically emerge in autoregressive LMs, while also delineating the architectural conditions under which it can arise.
2 Related Work
2.1 Language Grounding
Referential grounding has long been framed as the lexicon acquisition problem: how words map to referents in the world (Harnad, 1990; Gleitman & Landau, 1994; Clark, 1995). Early work focused on word-to-symbol mappings, designing learning mechanisms that simulate childrenâs lexical acquisition and explain psycholinguistic phenomena (Siskind, 1996; Regier, 2005; Goodman et al., 2007; Fazly et al., 2010). Subsequent studies incorporated visual grounding, first by aligning words with object categories (Roy & Pentland, 2002; Yu, 2005; Xu & Tenenbaum, 2007; Yu & Ballard, 2007; Yu & Siskind, 2013), and later by mapping words to richer visual features (Qu & Chai, 2010; Mao et al., 2019; 2021; Pratt et al., 2020). More recently, large-scale VLMs trained with paired textâimage supervision have advanced grounding to finer levels of granularity, achieving region-level (Li et al., 2022; Ma et al., 2023; Chen et al., 2023; You et al., 2024; Wang et al., 2024) and pixel-level (Xia et al., 2024; Rasheed et al., 2024; Zhang et al., 2024b) grounding, with strong performance on referring expression comprehension (Chen et al., 2024a).
Recent work suggests that grounding emerges as a property of VLMs trained without explicit supervision, with evidence drawn from attention-based spatial localization (Cao et al., 2025; Bousselham et al., 2024) and cross-modal geometric correspondences (Schnaus et al., 2025). However, all prior work focused exclusively on static final-stage models, overlooking the training trajectory, a crucial aspect for understanding when and how grounding emerges. In addition, existing work has framed grounding through correlations between visual and textual signals, diverging from the definition by Harnad (1990), which emphasizes causal links from symbols to meanings. To address these issues, we systematically examine learning dynamics throughout the training process, applying causal interventions to probe model internals and introducing control groups to enable rigorous comparison.
2.2 Emergent Capabilities and Learning Dynamics of LMs
A central debate concerns whether larger language models exhibit genuinely new behaviors: Wei et al. (2022) highlight abrupt improvements in tasks, whereas later studies argue such effects are artifacts of thresholds or in-context learning dynamics (Schaeffer et al., 2023; Lu et al., 2024). Beyond end performance, developmental analyses show that models acquire linguistic abilities in systematic though heterogeneous orders with variability across runs and checkpoints (Sellam et al., 2021; Blevins et al., 2022; Biderman et al., 2023; Xia et al., 2023; van der Wal et al., 2025). Psychology-inspired perspectives further emphasize controlled experimentation to assess these behaviors (Hagendorff, 2023), and comparative studies reveal both parallels and divergences between machine and human language learning (Chang & Bergen, 2022; Evanson et al., 2023; Chang et al., 2024; Ma et al., 2025). At a finer granularity, hidden-loss analyses identify phase-like transitions (Kangaslahti et al., 2025), while distributional studies attribute emergence to stochastic differences across training seeds (Zhao et al., 2024). Together, emergent abilities are not sharp discontinuities but probabilistic outcomes of developmental learning dynamics. Following this line of work, we present a probability- and model internalsâbased analysis of how symbol grounding emerges during language model training.
2.3 Mechanistic Interpretability of LMs
Mechanistic interpretability has largely focused on attention heads in Transformers (Elhage et al., 2021; Olsson et al., 2022; Meng et al., 2022; Bietti et al., 2023; Lieberum et al., 2023; Wu et al., 2025a). A central line of work established that induction heads emerge to support in-context learning (ICL; Elhage et al., 2021; Olsson et al., 2022), with follow-up studies tracing their training dynamics (Bietti et al., 2023) and mapping factual recall circuits (Meng et al., 2022). At larger scales, Lieberum et al. (2023) identified specialized content-gatherer and correct-letter heads, and Wu et al. (2025a) showed that a sparse set of retrieval heads is critical for reasoning and long-context performance. Relatedly, Wang et al. (2023) demonstrated that label words in demonstrations act as anchors: early layers gather semantic information into these tokens, which later guide prediction. Based on these insights, Bick et al. (2025) proposed that retrieval is implemented through a coordinated gather-and-aggregate (G&A) mechanism: some heads collect content from relevant tokens, while others aggregate it at the prediction position. Other studies extended this line of work by analyzing failure modes and training dynamics (Wiegreffe et al., 2025) and contrasting retrieval mechanisms in Transformers and SSMs (Arora et al., 2025). Whereas prior analyses typically investigate ICL with repeated syntactic or symbolic formats, our setup requires referential alignment between linguistic forms and their environmental contexts, providing a complementary testbed for naturalistic language grounding.
3 Method
Table 1: Training and test examples across datasets with target word book. The training examples combine environmental tokens ( $\langle$ ENV $\rangle$ ; shaded) with linguistic tokens ( $\langle$ LAN $\rangle$ ). Test examples are constructed with either matched (book) or mismatched (toy) environmental contexts, paired with corresponding linguistic prompts. Note that in child-directed speech and caption-grounded dialogue, book ${}_{\texttt{$\langle$ENV$\rangle$}}$ and book ${}_{\texttt{$\langle$LAN$\rangle$}}$ are two distinct tokens received by LMs.
| Child-Directed Speech | \cellcolor tticblue!10 $\langle$ CHI $\rangle$ takes book from mother | $\langle$ CHI $\rangle$ whatâs that $\langle$ MOT $\rangle$ a book in it ⌠| \cellcolor tticblue!10 $\langle$ CHI $\rangle$ asked for a new book | \cellcolor tticblue!10 $\langle$ CHI $\rangle$ asked for a new toy | $\langle$ CHI $\rangle$ I love this |
| --- | --- | --- | --- | --- | --- |
| Caption-Grounded Dialogue | \cellcolor tticblue!10 a dog appears to be reading a book with a full bookshelf behind | $\langle$ Q $\rangle$ can you tell what book itâs reading $\langle$ A $\rangle$ the marriage of true minds by stephen evans | \cellcolor tticblue!10 this is a book | \cellcolor tticblue!10 this is a toy | $\langle$ Q $\rangle$ can you name this object $\langle$ A $\rangle$ |
| Image-Grounded Dialogue | \cellcolor tticblue!10
<details>
<summary>figs/data/book-train.jpg Details</summary>

### Visual Description
\n
## Photograph: Dog with Book
### Overview
The image is a photograph of a black and white dog lying on a wooden floor next to a yellow book. The dog is positioned in front of a bookshelf filled with books. The scene appears to be indoors, likely a home library or living room. The image does not contain charts, diagrams, or data tables. It is a static scene with no quantifiable data.
### Components/Axes
There are no axes or components in the traditional sense of a chart or diagram. The key elements are:
* **Dog:** A medium-sized dog with black and white fur.
* **Book:** A yellow-covered book titled "The Marriage of True Minds" by Stephen Evans.
* **Bookshelf:** A wooden bookshelf filled with numerous books.
* **Floor:** A wooden floor.
### Detailed Analysis or Content Details
The book cover is predominantly yellow. The title "THE MARRIAGE OF TRUE MINDS" is written in black, bold, capital letters. Below the title is a small illustration of a red pot with a plant growing out of it. The author's name, "Stephen Evans," is printed in black at the bottom of the cover.
A quote is visible on the book cover: "A funny, poignant, brilliantly observed book." - Kirkus Reviews.
The bookshelf contains books with visible titles including:
* "In Defense of Animals"
* "Wild Animals"
* "The Sea Around Us"
* "The Strouds"
* "Animal Rights: The Issues The Movement"
* "Bears Almost Human"
The dog is looking slightly to the right of the frame with a relaxed expression. The dog's paws are extended forward.
### Key Observations
The image is a staged scene, likely intended to be humorous or whimsical. The presence of books related to animals suggests a possible theme of animal intelligence or sentience. The book title, "The Marriage of True Minds," is a reference to Shakespeare's Sonnet 116, which explores the nature of true love and connection.
### Interpretation
The image appears to be a playful commentary on intelligence, literature, and the relationship between humans and animals. The dog's proximity to the book, and its seemingly thoughtful gaze, could be interpreted as a suggestion that animals are capable of understanding complex ideas or emotions. The book's title, referencing a deep connection, might imply a desire for understanding or empathy between species. The bookshelf filled with books on animals further reinforces this theme. The image is not presenting data, but rather a visual narrative that invites interpretation and reflection. It's a scene designed to evoke thought and perhaps a smile.
</details>
| $\langle$ Q $\rangle$ can you tell what book itâs reading $\langle$ A $\rangle$ the marriage of true minds by stephen evans | \cellcolor tticblue!10
<details>
<summary>figs/data/book-test.jpg Details</summary>

### Visual Description
\n
## Photograph: Bookshelf Interior
### Overview
The image depicts a large, built-in wooden bookshelf filled with books, photographs, and decorative objects. The bookshelf spans a significant portion of a wall and is divided into multiple sections with both open shelving and closed cabinet space. The overall impression is one of a well-established, personal library. There is no chart, diagram, or data to extract. This is a descriptive analysis.
### Components/Axes
There are no axes or legends present in the image. The primary components are:
* **Bookshelf Structure:** Constructed from dark wood, with a combination of open shelves and cabinet doors.
* **Books:** A large collection of books of varying sizes and colors.
* **Photographs:** Numerous framed photographs displayed on the shelves.
* **Decorative Objects:** Including a model ship, clocks, figurines, and other small items.
* **Wall Color:** A pale yellow/cream color.
* **Chair:** A dark-colored chair is partially visible on the right side of the image.
### Detailed Analysis or Content Details
The bookshelf is divided into four main sections.
* **Left Section:** Contains a mix of books and smaller decorative items. The books appear to be a variety of sizes and colors.
* **Center-Left Section:** Primarily filled with books, arranged densely on the shelves.
* **Center-Right Section:** Contains a mix of books and framed photographs. The photographs appear to be portraits of individuals.
* **Right Section:** Features a model ship on the top shelf and a collection of books, photographs, and decorative objects on the lower shelves.
The books are arranged in a generally horizontal fashion, with some vertical stacking. The photographs are primarily framed and displayed upright. The decorative objects are scattered throughout the shelves, adding visual interest.
The cabinet doors are a lighter wood tone than the surrounding bookshelf structure. The hardware on the cabinet doors appears to be brass or gold-colored.
### Key Observations
* The bookshelf appears to be well-maintained and organized, though not rigidly so.
* The collection of books suggests a diverse range of interests.
* The presence of numerous photographs indicates a personal and sentimental value to the space.
* The model ship is a prominent decorative element, suggesting an interest in maritime history or sailing.
* The overall aesthetic is traditional and classic.
### Interpretation
The image portrays a space dedicated to learning, memory, and personal expression. The bookshelf is not merely a storage unit for books, but a curated collection that reflects the owner's interests, experiences, and relationships. The combination of books, photographs, and decorative objects creates a warm and inviting atmosphere. The arrangement of items suggests a deliberate effort to create a visually appealing and meaningful display. The lack of any clear organizational system (e.g., alphabetical order, genre) suggests that the books are valued for their individual significance rather than their categorization. The presence of the model ship could indicate a passion for nautical themes or a connection to a specific historical period. The overall impression is one of a comfortable and intellectually stimulating environment. The image does not provide any quantifiable data or trends, but rather offers a glimpse into the personal world of the bookshelf's owner.
</details>
| \cellcolor tticblue!10
<details>
<summary>figs/data/book-test-control.jpg Details</summary>

### Visual Description
\n
## Photograph: Display Cabinets with Collectibles
### Overview
The image depicts a series of dark wood display cabinets with glass fronts, showcasing various collectibles. A model ship is positioned on top of the rightmost cabinet. The background is a plain yellow wall. The image does not contain charts, graphs, or data tables. It is a static visual representation of objects.
### Components/Axes
There are no axes or legends present in the image. The primary components are:
* **Display Cabinets:** Four large, dark wood cabinets arranged in a row. Each cabinet has a glass-fronted display area and lower storage sections.
* **Collectibles:** Various items are visible within the cabinets, appearing to be miniature figures, models, and other small objects. The specific items are difficult to discern due to the glass and reflections.
* **Model Ship:** A detailed model of a sailing ship is placed on top of the rightmost cabinet.
* **Background:** A plain yellow wall serves as the backdrop.
* **Tripod:** A black tripod is partially visible on the right side of the image.
### Detailed Analysis or Content Details
The cabinets are constructed from a dark, possibly stained wood with a prominent grain pattern. The glass fronts appear to be slightly reflective, obscuring a clear view of the contents.
* **Cabinet 1 (Leftmost):** Contains a collection of small objects, possibly miniature figures or tools.
* **Cabinet 2:** Appears to contain a collection of green and white objects, possibly plants or figurines.
* **Cabinet 3:** Contains a collection of objects, some of which appear to be reddish in color.
* **Cabinet 4 (Rightmost):** Contains a collection of objects, partially obscured by the tripod.
The model ship on top of the rightmost cabinet is a detailed replica of a sailing vessel, with multiple masts, sails, and rigging. It is approximately 1/4 to 1/3 the height of the cabinets.
### Key Observations
The arrangement of the cabinets suggests a deliberate display of collectibles. The use of dark wood and glass creates a formal and elegant presentation. The model ship adds a nautical theme to the display. The presence of the tripod suggests the photograph was taken in a setting where photography is common.
### Interpretation
The image likely depicts a collector's display in a home or museum setting. The items within the cabinets are likely of personal significance to the owner, representing a hobby or interest. The careful arrangement and presentation suggest a desire to showcase these items in an aesthetically pleasing manner. The photograph itself doesn't offer any specific data or trends, but it provides a glimpse into the world of collecting and the personal expression that it can represent. The choice of a nautical theme with the model ship could indicate an interest in maritime history or sailing. The overall impression is one of curated nostalgia and personal passion.
</details>
| what do we have here? |
3.1 Dataset and Tokenization
To capture the emergent grounding from multimodal interactions, we design a minimal testbed with a custom word-level tokenizer, in which every lexical item is represented in two corresponding forms: one token that appears in non-verbal descriptions (e.g., a book in the scene description) and another that appears in utterances (e.g., book in speech). We refer to these by environmental ( $\langle$ ENV $\rangle$ ) and linguistic tokens ( $\langle$ LAN $\rangle$ ), respectively. For instance, book ${}_{\texttt{$\langle$ENV$\rangle$}}$ and book ${}_{\texttt{$\langle$LAN$\rangle$}}$ are treated as distinct tokens with separate integer indices; that is, the tokenization provides no explicit signal that these tokens are related, so any correspondence between them must be learned during training rather than inherited from their surface form. We instantiate this framework in three datasets, ranging from child-directed speech transcripts to image-based dialogue.
Child-directed speech. The Child Language Data Exchange System (CHILDES; MacWhinney, 2000) provides transcripts of speech enriched with environmental annotations. See the manual for data usage: https://talkbank.org/0info/manuals/CHAT.pdf We use the spoken utterances as the linguistic tokens ( $\langle$ LAN $\rangle$ ) and the environmental descriptions as the environment tokens ( $\langle$ ENV $\rangle$ ). The environmental context is drawn from three annotation types:
- Local events: simple events, pauses, long events, or remarks interleaved with the transcripts.
- Action tiers: actions performed by the speaker or listener (e.g., %act: runs to toy box). These also include cases where an action replaces speech (e.g., 0 [% kicks the ball]).
- Situational tiers: situational information tied to utterances or to larger contexts (e.g., %sit: dog is barking).
Caption-grounded dialogue. The Visual Dialog dataset (Das et al., 2017) pairs MSCOCO images (Lin et al., 2014) with sequential question-answering based multi-turn dialogues that exchange information about each image. Our setup uses MSCOCO captions as the environmental tokens ( $\langle$ ENV $\rangle$ ) and the dialogue turns form the linguistic tokens ( $\langle$ LAN $\rangle$ ). In this pseudo cross-modal setting, textual descriptions of visual scenes ground natural conversational interaction. Compared to CHILDES, this setup introduces richer semantics and longer utterances, while still using text-based inputs for both token types, thereby offering a stepping stone toward grounding in fully visual contexts.
Image-grounded dialogue. To move beyond textual proxies, we consider an image-grounded dialogue setup, using the same dataset as the caption-grounded dialogue setting. Here, a frozen vision transformer (ViT; Dosovitskiy et al., 2020) directly tokenizes each RGB image into patch embeddings, with each embedding treated as an $\langle$ ENV $\rangle$ token, analogously to the visual tokens in modern VLMs. We use DINOv2 (Oquab et al., 2024) as our ViT tokenizer, as it is trained purely on vision data without auxiliary text supervision (in contrast to models like CLIP; Radford et al., 2021), thereby ensuring that environmental tokens capture only visual information. The linguistic tokens ( $\langle$ LAN $\rangle$ ) remain unchanged from the caption-grounded dialogue setting, resulting in a realistic multimodal interaction where conversational utterances are grounded directly in visual input.
3.2 Evaluation Protocol
We assess symbol grounding with a contrastive test that asks whether a model assigns a higher probability to the correct linguistic token when the matching environmental token is in context, following the idea of priming in psychology. This evaluation applies uniformly across datasets (Table 1): in CHILDES and caption-grounded dialogue, environmental priming comes from descriptive contexts; in image-grounded dialogue, from ViT-derived visual tokens. We compare the following conditions:
- Match (experimental condition): The context contains the corresponding $\langle$ ENV $\rangle$ token for the target word, and the model is expected to predict its $\langle$ LAN $\rangle$ counterpart.
- Mismatch (control condition): The context is replaced with a different $\langle$ ENV $\rangle$ token. The model remains tasked with predicting the same $\langle$ LAN $\rangle$ token; however, in the absence of corresponding environmental cues, its performance is expected to be no better than chance.
For example (first row in Table 1), when evaluating the word $\textit{book}_{\texttt{$\langle$LAN$\rangle$}}$ , the input context is
$$
\displaystyle\vskip-2.0pt\langle\textit{CHI}\rangle\textit{ asked}_{\texttt{$\langle$ENV$\rangle$}}\textit{ for}_{\texttt{$\langle$ENV$\rangle$}}\textit{ a}_{\texttt{$\langle$ENV$\rangle$}}\textit{ new}_{\texttt{$\langle$ENV$\rangle$}}\textit{ book}_{\texttt{$\langle$ENV$\rangle$}}\textit{ }\langle\textit{CHI}\rangle\textit{ I}_{\texttt{$\langle$LAN$\rangle$}}\textit{ love}_{\texttt{$\langle$LAN$\rangle$}}\textit{ this}_{\texttt{$\langle$LAN$\rangle$}}\textit{ }\underline{\hskip 30.00005pt},\vskip-2.0pt \tag{1}
$$
where the model is expected to predict $\textit{book}_{\texttt{$\langle$LAN$\rangle$}}$ for the blank, and the role token $\langle$ CHI $\rangle$ indicates the involved speaker or actorâs role being a child. In the control (mismatch) condition, the environmental token box ${}_{\texttt{$\langle$ENV$\rangle$}}$ is replaced by another valid noun such as toy ${}_{\texttt{$\langle$ENV$\rangle$}}$ .
Context templates. For a target word $v$ with linguistic token $v_{\texttt{$\langle$LAN$\rangle$}}$ and environmental token $v_{\texttt{$\langle$ENV$\rangle$}}$ , we denote $\overline{C}_{v}$ as a set of context templates of $v$ . For example, when $v=\textit{book}$ , a $\overline{c}â\overline{C}_{v}$ can be
$$
\displaystyle\vskip-2.0pt\langle\textit{CHI}\rangle\textit{ asked}_{\texttt{$\langle$ENV$\rangle$}}\textit{ for}_{\texttt{$\langle$ENV$\rangle$}}\textit{ a}_{\texttt{$\langle$ENV$\rangle$}}\textit{ new}_{\texttt{$\langle$ENV$\rangle$}}\textit{ }\texttt{[FILLER]}\textit{ }\langle\textit{CHI}\rangle\textit{ I}_{\texttt{$\langle$LAN$\rangle$}}\textit{ love}_{\texttt{$\langle$LAN$\rangle$}}\underline{\hskip 30.00005pt},\vskip-2.0pt \tag{2}
$$
where [FILLER] is to be replaced with an environmental token, and the blank indicates the expected prediction as in Eq. (1). In the match condition, the context $\overline{c}(v)$ is constructed by replacing [FILLER] with $v_{\texttt{$\langle$ENV$\rangle$}}$ in $\overline{c}$ . In the mismatch condition, the context $\overline{c}(u)$ uses $u_{\texttt{$\langle$ENV$\rangle$}}(uâ v)$ as the filler, while the prediction target remains $v_{\texttt{$\langle$LAN$\rangle$}}$ .
For the choices of $v$ and $u$ , we construct the vocabulary $V$ with 100 nouns from the MacArthurâBates Communicative Development Inventories (Fenson et al., 2006) that occur frequently in our corpus. Each word serves once as the target, with the remaining $M=99$ used to construct mismatched conditions. For each word, we create $N=10$ context templates, which contain both $\langle$ ENV $\rangle$ and $\langle$ LAN $\rangle$ tokens. Details of the vocabulary and context template construction can be found in the Appendix A.
Grounding information gain. Following prior work, we evaluate how well an LM learns a word using the mean surprisal over instances. The surprisal of a word $w$ given a context $c$ is defined as $s_{\boldsymbol{\theta}}(w\mid c)=-\log P_{\boldsymbol{\theta}}(w\mid c),$ where $P_{\boldsymbol{\theta}}(w\mid c)$ denotes the probability, under an LM parameterized by ${\boldsymbol{\theta}}$ , that the next word is $w$ conditioned on the context $c$ . Here, $s_{\boldsymbol{\theta}}(w\mid c)$ quantifies the unexpectedness of predicting $w$ , or the pointwise information carried by $w$ conditioned on the context.
The grounding information gain $G_{\boldsymbol{\theta}}(v)$ for $v$ is defined as
| | $\displaystyle G_{\boldsymbol{\theta}}(v)=\frac{1}{N}\sum_{n=1}^{N}\left(\frac{1}{M}\sum_{uâ v}^{M}\Big[s_{\boldsymbol{\theta}}\left(v_{\texttt{$\langle$LAN$\rangle$}}\mid\overline{c}_{n}\left(u_{\texttt{$\langle$ENV$\rangle$}}\right)\right)-s_{\boldsymbol{\theta}}\left(v_{\texttt{$\langle$LAN$\rangle$}}\mid\overline{c}_{n}\left(v_{\texttt{$\langle$ENV$\rangle$}}\right)\right)\Big]\right).$ | |
| --- | --- | --- |
This is a sample-based estimation of the expected log-likelihood ratio between the match and mismatch conditions
| | $\displaystyle G_{\boldsymbol{\theta}}(v)=\mathbb{E}_{c,u}\left[\log\frac{P_{\boldsymbol{\theta}}(v_{\texttt{$\langle$LAN$\rangle$}}\mid c,v_{\texttt{$\langle$ENV$\rangle$}})}{P_{\boldsymbol{\theta}}(v_{\texttt{$\langle$LAN$\rangle$}}\mid c,u_{\texttt{$\langle$ENV$\rangle$}})}\right],$ | |
| --- | --- | --- |
which quantifies how much more information the matched ground provides for predicting the linguistic form, compared to a mismatched one. A positive $G_{\boldsymbol{\theta}}(v)$ indicates that the matched environmental token increases the predictability of its linguistic form. We report $G_{\boldsymbol{\theta}}=\frac{1}{|V|}\sum_{vâ V}G_{\boldsymbol{\theta}}(v)$ , and track $G_{{\boldsymbol{\theta}}^{(t)}}$ across training steps $t$ to analyze how grounding emerges over time.
3.3 Model Training
We train LMs from random initialization, ensuring that no prior linguistic knowledge influences the results. Our training uses the standard causal language modeling objective, as in most generative LMs. To account for variability, we repeat all experiments with 5 random seeds, randomizing both model initialization and corpus shuffle order. Our primary architecture is Transformer (Vaswani et al., 2017) in the style of GPT-2 (Radford et al., 2019) with 18, 12, and 4 layers, with all of them having residual connections. We extend the experiments to 4-layer unidirectional LSTMs (Hochreiter & Schmidhuber, 1997) with no residual connections, as well as 12- and 4-layer state-space models (specifically, Mamba-2; Dao & Gu, 2024). For fair comparison with LSTMs, the 4-layer Mamba-2 models do not involve residual connections, whereas the 12-layer ones do. For multimodal settings, while standard LLaVA (Liu et al., 2023) uses a two-layer perceptron to project ViT embeddings into the language model, we bypass this projection in our case and directly feed the DINOv2 representations into the LM. We obtain the developmental trajectory of the model by saving checkpoints at various training steps, sampling more heavily from earlier steps, following Chang & Bergen (2022).
4 Behavioral Evidence
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" and the other "Mismatch" conditions. The chart appears to track the evolution of surprisal during a training process.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000. The axis is linearly scaled.
* **Y-axis:** "Surprisal", ranging from approximately 4.5 to 12.5. The axis is linearly scaled.
* **Legend:** Located in the top-right corner of the chart.
* "Match" - represented by a dark blue line.
* "Mismatch" - represented by a light orange line.
### Detailed Analysis
The "Match" line (dark blue) starts at approximately 5.2 and exhibits a generally decreasing trend, leveling off around a surprisal value of 4.8 at 20000 training steps. The initial slope is steep, but it gradually becomes flatter as training progresses.
The "Mismatch" line (light orange) begins at approximately 11.0 and also shows a decreasing trend, but it plateaus at a higher surprisal value than the "Match" line, around 6.8 at 20000 training steps. The initial decrease is rapid, but the line fluctuates more than the "Match" line, indicating greater variability.
Here's a breakdown of approximate data points:
**Match (Dark Blue):**
* 0 Training Steps: ~5.2 Surprisal
* 2000 Training Steps: ~4.9 Surprisal
* 4000 Training Steps: ~4.7 Surprisal
* 6000 Training Steps: ~4.6 Surprisal
* 8000 Training Steps: ~4.5 Surprisal
* 10000 Training Steps: ~4.4 Surprisal
* 12000 Training Steps: ~4.3 Surprisal
* 14000 Training Steps: ~4.2 Surprisal
* 16000 Training Steps: ~4.1 Surprisal
* 18000 Training Steps: ~4.0 Surprisal
* 20000 Training Steps: ~4.8 Surprisal
**Mismatch (Light Orange):**
* 0 Training Steps: ~11.0 Surprisal
* 2000 Training Steps: ~8.0 Surprisal
* 4000 Training Steps: ~7.2 Surprisal
* 6000 Training Steps: ~6.8 Surprisal
* 8000 Training Steps: ~6.6 Surprisal
* 10000 Training Steps: ~6.5 Surprisal
* 12000 Training Steps: ~6.5 Surprisal
* 14000 Training Steps: ~6.6 Surprisal
* 16000 Training Steps: ~6.7 Surprisal
* 18000 Training Steps: ~6.7 Surprisal
* 20000 Training Steps: ~6.8 Surprisal
### Key Observations
* The "Mismatch" condition consistently exhibits higher surprisal values than the "Match" condition throughout the training process.
* Both conditions demonstrate a decreasing trend in surprisal, suggesting that the model is learning and becoming more confident in its predictions.
* The "Mismatch" line shows more fluctuation, indicating that the model struggles more with mismatched data.
* The "Match" line appears to converge to a lower surprisal value, suggesting better performance on matched data.
### Interpretation
The chart likely represents the training dynamics of a model designed to identify matches or mismatches between data points. "Surprisal" can be interpreted as a measure of how unexpected or uncertain the model is about its predictions. A higher surprisal value indicates greater uncertainty.
The decreasing trend in both lines suggests that the model is learning to better distinguish between "Match" and "Mismatch" conditions as training progresses. The consistently higher surprisal for "Mismatch" indicates that the model finds it more difficult to process or predict mismatched data, which is expected. The convergence of the "Match" line to a lower surprisal value suggests that the model is becoming highly confident in its ability to identify matched data.
The fluctuations in the "Mismatch" line could indicate that the model is encountering diverse or challenging mismatched examples during training. This could be due to noise in the data, complex relationships between features, or limitations in the model's capacity. Further investigation into the nature of the mismatched data could provide insights into how to improve the model's performance.
</details>
(a) 12-layer Transformer.
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" and the other "Mismatch". The chart appears to track the surprisal of a model during training, potentially indicating how well the model is learning to predict or represent the data.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000. The axis is linearly scaled.
* **Y-axis:** "Surprisal", ranging from approximately 5.0 to 12.5. The axis is linearly scaled.
* **Legend:** Located in the top-right corner of the chart.
* "Match" - represented by a dark blue line.
* "Mismatch" - represented by a light orange line.
### Detailed Analysis
* **Match (Dark Blue Line):** The line starts at approximately 7.3 at 0 training steps and exhibits a generally downward trend, indicating decreasing surprisal as training progresses.
* At approximately 5000 training steps, the surprisal is around 6.5.
* At approximately 10000 training steps, the surprisal is around 5.8.
* At approximately 15000 training steps, the surprisal is around 5.4.
* At approximately 20000 training steps, the surprisal is around 5.1.
* **Mismatch (Light Orange Line):** The line begins at approximately 7.5 at 0 training steps and also shows a decreasing trend, but it plateaus at a higher surprisal level than the "Match" line.
* At approximately 5000 training steps, the surprisal is around 6.8.
* At approximately 10000 training steps, the surprisal is around 6.5.
* At approximately 15000 training steps, the surprisal is around 6.4.
* At approximately 20000 training steps, the surprisal is around 6.3.
### Key Observations
* Both "Match" and "Mismatch" lines demonstrate a decreasing surprisal with increasing training steps, suggesting that the model is learning over time.
* The "Match" line consistently exhibits lower surprisal values than the "Mismatch" line throughout the entire training process. This indicates that the model is better at predicting or representing the "Match" data compared to the "Mismatch" data.
* The rate of decrease in surprisal appears to slow down for both lines as training progresses, suggesting diminishing returns from further training.
* The "Mismatch" line appears to converge towards a stable surprisal value around 6.3, while the "Match" line continues to decrease, albeit at a slower rate.
### Interpretation
The chart suggests that the model is learning to better represent the "Match" data than the "Mismatch" data. The decreasing surprisal for both lines indicates that the model is improving its predictive capabilities with more training. The difference in surprisal between the two lines could be due to several factors, such as:
* The "Match" data being inherently easier to model.
* The "Mismatch" data containing more noise or complexity.
* The model being specifically designed to perform well on the "Match" data.
The plateauing of the "Mismatch" line suggests that the model may have reached its limit in representing this type of data, or that further training would require a different approach. The continued decrease in surprisal for the "Match" line indicates that further training could still yield improvements in performance. This data could be used to evaluate the effectiveness of a training process, or to identify areas where the model could be improved.
</details>
(b) 4-layer Transformer.
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" and the other "Mismatch". The chart appears to track the surprisal of a model during training, potentially indicating how well the model is learning to predict or represent the data.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000. The axis is linearly scaled.
* **Y-axis:** "Surprisal", ranging from approximately 4.5 to 12.5. The axis is linearly scaled.
* **Legend:** Located in the top-right corner.
* "Match" - represented by a dark blue line.
* "Mismatch" - represented by an orange line.
### Detailed Analysis
**Match (Dark Blue Line):**
The "Match" line begins at approximately 6.0 at 0 training steps. It exhibits a steep downward trend initially, decreasing to a minimum of approximately 4.2 at around 2000 training steps. After this point, the line plateaus and fluctuates between approximately 4.2 and 5.0 until 20000 training steps, ending at approximately 4.6.
**Mismatch (Orange Line):**
The "Mismatch" line starts at approximately 7.5 at 0 training steps. It shows a slight initial decrease to around 7.2 at 2000 training steps. From 2000 to 20000 training steps, the line remains relatively stable, fluctuating between approximately 7.2 and 7.8, ending at approximately 7.6.
### Key Observations
* The "Match" line consistently exhibits lower surprisal values than the "Mismatch" line throughout the entire training process.
* The "Match" line demonstrates a significant decrease in surprisal during the initial 2000 training steps, suggesting rapid learning or adaptation.
* Both lines appear to converge towards a stable state after approximately 2000 training steps, indicating that the rate of change in surprisal diminishes over time.
* The "Mismatch" line shows minimal change in surprisal, suggesting that the model struggles to learn or represent the mismatched data.
### Interpretation
The chart suggests that the model is learning to better represent or predict the "Match" data as training progresses, as evidenced by the decreasing surprisal. The relatively constant surprisal for the "Mismatch" data indicates that the model is not effectively learning from this data, potentially due to inherent differences or complexities in the mismatched examples. The convergence of both lines towards stable values after 2000 training steps suggests that the model's learning capacity or the effectiveness of the training process may be reaching a limit. The difference in surprisal between the two conditions could be used as a metric to evaluate the model's ability to distinguish between matched and mismatched data. The data suggests that the model is more successful at learning the "Match" data than the "Mismatch" data.
</details>
(c) 4-layer Mamba 2.
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" and the other "Mismatch" conditions. The chart appears to track the change in surprisal during a training process.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000. The axis is linearly scaled.
* **Y-axis:** "Surprisal", ranging from approximately 5.0 to 12.5. The axis is linearly scaled.
* **Legend:** Located in the top-right corner of the chart.
* "Match" - represented by a blue line.
* "Mismatch" - represented by an orange line.
### Detailed Analysis
* **Match (Blue Line):** The line starts at approximately 7.2 at 0 training steps. It initially decreases rapidly to a minimum of approximately 6.8 at around 5000 training steps. After this point, the line plateaus and fluctuates around a value of approximately 7.0 until 20000 training steps.
* **Mismatch (Orange Line):** The line begins at approximately 12.3 at 0 training steps. It decreases sharply initially, reaching a value of approximately 8.0 at around 2000 training steps. The rate of decrease slows down, and the line continues to descend, reaching approximately 7.4 at 20000 training steps.
### Key Observations
* Both "Match" and "Mismatch" lines exhibit a decreasing trend in surprisal as training steps increase, indicating a learning or adaptation process.
* The "Mismatch" condition consistently has a higher surprisal value than the "Match" condition throughout the entire training process.
* The rate of decrease in surprisal is more pronounced in the initial stages of training for both conditions.
* The "Match" line appears to converge towards a stable value around 7.0, while the "Mismatch" line continues to decrease, albeit at a slower rate, until 20000 training steps.
### Interpretation
The chart suggests that the training process reduces the surprisal associated with both "Match" and "Mismatch" conditions. The higher initial and sustained surprisal in the "Mismatch" condition indicates that the model finds it more difficult to predict or accommodate mismatched data. The convergence of the "Match" line suggests that the model learns to effectively handle matching data, while the continued decrease in the "Mismatch" line implies that the model is still adapting to handle mismatched data, but is not fully converging. This could indicate that the mismatch condition represents a more complex or challenging learning scenario. The data suggests a potential difference in the model's ability to generalize to mismatched data compared to matched data. The chart provides insight into the learning dynamics of the model and the impact of data matching on its performance.
</details>
(d) 4-layer LSTM.
Figure 2: Average surprisal of the experimental and control conditions over training steps.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the relationship between training steps and two performance metrics: Information Gain and R² value. The chart displays how these metrics evolve during the training process, likely of a machine learning model. The chart uses a dual y-axis to accommodate the different scales of the two metrics.
### Components/Axes
* **X-axis:** "Training steps" ranging from 0 to approximately 20000.
* **Left Y-axis:** "R² values" ranging from 0 to 0.8.
* **Right Y-axis:** "Information gain" ranging from 0 to 6.
* **Legend:** Located in the top-left corner, identifying two lines:
* "Information gain" (Blue line)
* "R² value" (Orange line)
### Detailed Analysis
**R² Value (Orange Line):**
The orange line, representing the R² value, starts at approximately 0 at 0 training steps. It exhibits a rapid increase, peaking at around 0.42 at approximately 5000 training steps. Following the peak, the R² value gradually declines, stabilizing around 0.28 at 20000 training steps. The trend is initially strongly upward, then becomes downward, eventually flattening.
* 0 Training Steps: R² â 0.0
* 5000 Training Steps: R² â 0.42
* 10000 Training Steps: R² â 0.35
* 15000 Training Steps: R² â 0.30
* 20000 Training Steps: R² â 0.28
**Information Gain (Blue Line):**
The blue line, representing Information Gain, begins at approximately 0 at 0 training steps. It demonstrates a consistent, though decelerating, upward trend throughout the entire training period. The slope of the line decreases as the number of training steps increases, indicating diminishing returns in information gain.
* 0 Training Steps: Information Gain â 0.0
* 5000 Training Steps: Information Gain â 1.5
* 10000 Training Steps: Information Gain â 2.2
* 15000 Training Steps: Information Gain â 2.6
* 20000 Training Steps: Information Gain â 2.8
### Key Observations
* The R² value initially increases rapidly, suggesting a quick improvement in model fit during the early stages of training. However, this improvement plateaus and eventually reverses, indicating potential overfitting or diminishing returns.
* Information gain consistently increases, but at a decreasing rate, suggesting that the model continues to learn but with less significant gains as training progresses.
* The two metrics exhibit contrasting trends. While R² peaks and then declines, information gain continues to rise, albeit at a slower pace.
### Interpretation
The chart suggests a typical training dynamic where a model initially learns quickly (as indicated by the rising R² value), but eventually reaches a point of diminishing returns or begins to overfit (as indicated by the declining R² value). The continuous increase in information gain suggests that the model is still extracting useful information from the training data, even as its ability to generalize (as measured by R²) plateaus.
The divergence between the two metrics could indicate that the model is becoming increasingly complex and is memorizing the training data rather than learning underlying patterns. This could be a signal to consider regularization techniques or early stopping to prevent overfitting and improve the model's generalization performance. The flattening of the information gain curve at higher training steps suggests that further training may not yield significant improvements in model performance.
</details>
(a) 12-layer Transformer.
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the relationship between training steps and two performance metrics: Information Gain and R² value. The chart appears to track the progress of a machine learning model during training. The chart uses a dual y-axis to display both metrics simultaneously.
### Components/Axes
* **X-axis:** "Training steps" ranging from approximately 0 to 20000.
* **Left Y-axis:** "R² values" ranging from 0.0 to 0.8.
* **Right Y-axis:** "Information gain" ranging from 0 to 6.
* **Legend:** Located in the top-left corner, identifying two lines:
* "Information gain" (Blue line)
* "R² value" (Orange line)
### Detailed Analysis
**R² Value (Orange Line):**
The orange line representing the R² value starts at approximately 0.0 at 0 training steps. It rapidly increases to a peak of approximately 0.45 at around 2000 training steps. After the peak, it declines, oscillating between approximately 0.15 and 0.25, eventually settling around 0.18 at 20000 training steps. The trend is initially strongly upward, then becomes more erratic with a general downward slope.
**Information Gain (Blue Line):**
The blue line representing Information Gain starts at approximately 0.0 at 0 training steps. It exhibits a slow, steady increase throughout the entire training period, reaching a value of approximately 2.3 at 20000 training steps. The trend is consistently upward, though the rate of increase diminishes over time.
**Data Points (Approximate):**
| Training Steps | R² Value (Approx.) | Information Gain (Approx.) |
|----------------|--------------------|---------------------------|
| 0 | 0.0 | 0.0 |
| 2000 | 0.45 | 0.6 |
| 5000 | 0.25 | 1.2 |
| 10000 | 0.20 | 1.8 |
| 15000 | 0.18 | 2.1 |
| 20000 | 0.18 | 2.3 |
### Key Observations
* The R² value peaks early in training and then plateaus and fluctuates.
* Information Gain consistently increases throughout the training process.
* The scales on the y-axes are different, which is important to note when interpreting the relative magnitudes of the changes in each metric.
* The R² value shows signs of overfitting or diminishing returns after the initial increase.
### Interpretation
The chart suggests that the model initially learns quickly, as indicated by the rapid increase in the R² value. However, this learning appears to slow down and potentially reverse after a certain point, possibly due to overfitting or reaching the limits of the model's capacity to capture the underlying patterns in the data. The consistent increase in Information Gain suggests that the model continues to gain new information throughout the training process, even as the R² value plateaus. This could indicate that the model is becoming more complex but not necessarily more accurate in its predictions. The divergence between the two metrics suggests a trade-off between model complexity and generalization performance. Further investigation might be needed to determine the optimal training duration or to explore techniques for preventing overfitting.
</details>
(b) 4-layer Transformer.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the training performance of a model, tracking both Information Gain and R² value over Training Steps. The chart displays two metrics against a common x-axis of "Training steps", but uses a dual y-axis to accommodate their different scales. A shaded region around the "Information gain" line indicates a measure of variance or confidence.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000.
* **Left Y-axis:** "R² values", ranging from 0.0 to 0.8.
* **Right Y-axis:** "Information gain", ranging from 0 to 6.
* **Legend:** Located in the top-right corner, identifying the lines:
* "Information gain" (Blue)
* "R² value" (Orange)
* **Shaded Region:** A light blue shaded area surrounds the "Information gain" line, representing the standard deviation or confidence interval.
### Detailed Analysis
* **R² Value (Orange Line):** The R² value starts at approximately 0.0 at 0 training steps. It rapidly increases to a peak of around 0.25 at approximately 2500 training steps. After this peak, the R² value steadily declines, reaching approximately 0.05 by 20000 training steps.
* **Information Gain (Blue Line):** The Information Gain starts at approximately 0.0 at 0 training steps. It exhibits a relatively rapid increase, reaching around 3.5 at approximately 5000 training steps. The Information Gain then plateaus, fluctuating between approximately 3.5 and 4.5 for the remainder of the training period (up to 20000 steps). The shaded region around the line indicates that the Information Gain fluctuates between approximately 3.0 and 5.0.
### Key Observations
* The R² value initially increases with training, suggesting the model is initially learning and fitting the training data better. However, the subsequent decline indicates potential overfitting or diminishing returns from further training.
* The Information Gain shows a consistent increase and then stabilization, suggesting the model continues to acquire useful information during training, but the rate of information gain diminishes over time.
* The two metrics exhibit contrasting trends. While R² decreases after an initial increase, Information Gain continues to increase and stabilize.
### Interpretation
The chart suggests a typical training dynamic where a model initially improves its fit to the training data (as indicated by the rising R² value) and simultaneously gains information. However, the decreasing R² value after a certain point suggests that the model may be starting to overfit the training data, meaning it is learning the noise in the data rather than the underlying patterns. The continued increase in Information Gain, even as R² declines, could indicate that the model is still learning complex relationships, but these relationships may not generalize well to unseen data.
The contrasting trends of the two metrics highlight the importance of monitoring both fit (R²) and information acquisition (Information Gain) during model training. A decline in R² while Information Gain remains stable or increases could be a signal to stop training or to implement regularization techniques to prevent overfitting. The shaded region around the Information Gain line suggests that the model's learning process is not entirely deterministic and may be subject to some degree of randomness or variance.
</details>
(c) 4-layer Mamba 2.
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the relationship between training steps and two performance metrics: Information Gain and R² value. The chart tracks these metrics during a training process, likely for a machine learning model. The x-axis represents the number of training steps, while the left y-axis represents the R² value and the right y-axis represents the Information Gain.
### Components/Axes
* **X-axis:** "Training steps" ranging from 0 to approximately 20000.
* **Left Y-axis:** "R² values" ranging from 0 to 0.8.
* **Right Y-axis:** "Information gain" ranging from 0 to 6.
* **Legend:** Located in the top-left corner, containing two entries:
* "Information gain" - represented by a blue line.
* "R² value" - represented by an orange line.
### Detailed Analysis
* **R² Value (Orange Line):** The orange line representing the R² value starts at approximately 0.02 at 0 training steps. It exhibits a steep upward trend initially, reaching approximately 0.2 at around 2000 training steps. The slope gradually decreases, and the line continues to rise, reaching approximately 0.52 at 20000 training steps.
* **Information Gain (Blue Line):** The blue line representing Information Gain starts at approximately 0 at 0 training steps. It shows a slow, relatively linear increase throughout the training process. At 20000 training steps, the Information Gain reaches approximately 1.2.
### Key Observations
* The R² value increases rapidly during the initial training phase, indicating a significant improvement in model fit. The rate of improvement slows down as training progresses.
* Information Gain increases steadily but at a slower rate than the R² value.
* The R² value is significantly higher than the Information Gain across all training steps.
* Both metrics appear to converge towards a plateau as the number of training steps increases, suggesting diminishing returns from further training.
### Interpretation
The chart suggests that the model is learning and improving its fit to the training data, as evidenced by the increasing R² value. The initial rapid increase in R² indicates that the model quickly captures the most important patterns in the data. The slower increase later in training suggests that the model is refining its fit and capturing more subtle patterns. The steady increase in Information Gain indicates that the model is consistently gaining more information from the training data.
The difference in scale between the two metrics suggests that R² is a more sensitive measure of model performance in this context. The convergence of both metrics towards a plateau indicates that the model may be approaching its maximum performance level, and further training may not yield significant improvements. It is important to note that this analysis is based solely on the provided chart and does not consider other factors that may influence model performance, such as overfitting or generalization error.
</details>
(d) 4-layer LSTM.
Figure 3: Grounding information gain and its correlation to the co-occurrence of linguistic and environment tokens over training steps.
4.1 Behavioral Evidence of Emergent Grounding
In this section, we ask: Does symbol grounding emerge behaviorally in autoregressive LMs? We first test whether models show systematic surprisal reduction when predicting a linguistic token when its environmental counterpart is in context (Figure 2, where the gap between the lines represent the grounding information gain). For Transformers (Figures 2(a) and 2(b)) and Mamba-2 (Figure 2(c)), surprisal in the match condition decreases steadily while that in the mismatch condition enters a high-surprisal plateau early, indicating that the models leverage environmental context to predict the linguistic form. In contrast, the unidirectional LSTM (Figure 2(d)) shows little separation between the conditions, reflecting the absence of grounding. Overall, these results provide behavioral evidence of emergent grounding: in sufficiently expressive architectures (Transformers and Mamba-2), the correct environmental context reliably lowers surprisal for its linguistic counterpart, whereas LSTMs fail to exhibit this effect, marking an architectural boundary on where grounding can emerge.
4.2 Behavioral Effects Beyond Co-occurrence
A natural concern is that the surprisal reductions might be fully explainable by shallow statistics: the models might have simply memorized frequent co-occurrences of $\langle$ ENV $\rangle$ and $\langle$ LAN $\rangle$ tokens, without learning a deeper and more general mapping. We test this hypothesis by comparing the tokensâ co-occurrence with the grounding information gain in the child-directed speech data.
We define co-occurrence between the corresponding $\langle$ ENV $\rangle$ and $\langle$ LAN $\rangle$ tokens at the granularity of a 512-token training chunk. For each target word $v$ , we count the number of chunks in which both its $\langle$ ENV $\rangle$ and $\langle$ LAN $\rangle$ tokens appear. Following standard corpus-analysis practice, these raw counts are log-transformed. For each model checkpoint, we run linear regression between the log co-occurrence and the grounding information gain of words, obtaining an $R^{2}$ statistic as a function of training time.
Figure 3 shows the $R^{2}$ values (orange) alongside the grounding information gain (blue) for different architectures. In both the Transformer and Mamba-2, $R^{2}$ rises sharply at the early steps but then goes down, even if the grounding information gain continues increasing. These results suggest that grounding in Transformers and Mamba-2 cannot be fully accounted for by co-occurrence statistics: while models initially exploit surface co-occurrence regularities, later improvements in grounding diverge from these statistics, indicating reliance on richer and more complicated features acquired during training. In contrast, LSTM shows persistently increasing $R^{2}$ but little increase in grounding information gain over training steps, suggesting that it encodes co-occurrence but lacks the architectural mechanism to transform it into predictive grounding.
4.3 Visual Dialogue with Captions and Images
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Chart: Surprise vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" conditions and the other representing "Mismatch" conditions. The chart appears to track the evolution of surprisal during a training process.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000.
* **Y-axis:** "Surprisal", ranging from approximately 6 to 12.
* **Legend:** Located in the top-right corner.
* "Match" - represented by a blue line.
* "Mismatch" - represented by an orange line.
* **Data Series:** Two lines representing the "Match" and "Mismatch" conditions.
### Detailed Analysis
The "Match" line (blue) starts at approximately 10.5 and generally slopes downward, exhibiting fluctuations. The "Mismatch" line (orange) begins at approximately 11.5 and remains relatively stable, fluctuating around a value of 10.
Here's a breakdown of approximate data points, noting the inherent uncertainty in reading values from the image:
**Match (Blue Line):**
* 0 Training Steps: ~10.5 Surprisal
* 2000 Training Steps: ~9.5 Surprisal
* 4000 Training Steps: ~8.5 Surprisal
* 6000 Training Steps: ~7.8 Surprisal
* 8000 Training Steps: ~7.5 Surprisal
* 10000 Training Steps: ~7.2 Surprisal
* 12000 Training Steps: ~7.0 Surprisal
* 14000 Training Steps: ~7.2 Surprisal
* 16000 Training Steps: ~7.5 Surprisal
* 18000 Training Steps: ~7.7 Surprisal
* 20000 Training Steps: ~7.6 Surprisal
**Mismatch (Orange Line):**
* 0 Training Steps: ~11.5 Surprisal
* 2000 Training Steps: ~10.5 Surprisal
* 4000 Training Steps: ~10.2 Surprisal
* 6000 Training Steps: ~10.0 Surprisal
* 8000 Training Steps: ~10.2 Surprisal
* 10000 Training Steps: ~10.1 Surprisal
* 12000 Training Steps: ~10.3 Surprisal
* 14000 Training Steps: ~10.2 Surprisal
* 16000 Training Steps: ~10.1 Surprisal
* 18000 Training Steps: ~10.0 Surprisal
* 20000 Training Steps: ~10.1 Surprisal
### Key Observations
* The "Match" line demonstrates a clear decreasing trend in surprisal over the training steps, indicating that the model is becoming more confident in predicting matching conditions.
* The "Mismatch" line remains relatively constant, suggesting that the model's ability to predict mismatching conditions does not significantly improve with training.
* The gap between the "Match" and "Mismatch" lines widens as training progresses, indicating a growing difference in surprisal between the two conditions.
### Interpretation
The data suggests that the training process is more effective at improving the model's performance on "Match" conditions than on "Mismatch" conditions. The decreasing surprisal for "Match" indicates that the model is learning to better predict when inputs match a certain criterion. The relatively stable surprisal for "Mismatch" suggests that the model is not learning to effectively discriminate against non-matching inputs. This could indicate a bias in the training data or a limitation in the model's architecture. The widening gap between the two lines highlights the increasing disparity in performance between the two conditions as training continues. This could be a desirable outcome if the goal is to improve the model's ability to identify matching conditions, but it could also be a sign of overfitting or a lack of generalization to mismatching conditions. Further investigation would be needed to determine the underlying causes and potential solutions.
</details>
(a) Surprisal curves (w/ caption).
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Line Chart: Surprise vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" and "Training steps" for two conditions: "Match" and "Mismatch". The chart shows how surprisal changes as the model undergoes training.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 300,000.
* **Y-axis:** "Surprisal", ranging from approximately 7 to 12.
* **Data Series 1:** "Match" - represented by a blue line.
* **Data Series 2:** "Mismatch" - represented by an orange line.
* **Legend:** Located in the top-right corner, labeling the two data series with their corresponding colors.
### Detailed Analysis
The chart displays two downward-trending lines.
**Match (Blue Line):**
The blue line, representing "Match", starts at approximately 9.4 at 0 training steps. It consistently decreases, exhibiting a relatively smooth downward slope. At approximately 150,000 training steps, the value is around 8.2. By 300,000 training steps, the value stabilizes around 7.6.
**Mismatch (Orange Line):**
The orange line, representing "Mismatch", begins at approximately 10.2 at 0 training steps. It initially decreases more rapidly than the blue line, but the rate of decrease slows down. At approximately 150,000 training steps, the value is around 9.3. By 300,000 training steps, the value is approximately 9.6. The orange line exhibits more fluctuation than the blue line.
### Key Observations
* Both "Match" and "Mismatch" show a decreasing trend in surprisal as training steps increase, indicating that the model is learning and becoming more confident in its predictions.
* The "Match" condition consistently exhibits lower surprisal values than the "Mismatch" condition throughout the training process.
* The "Mismatch" line shows more variability than the "Match" line, suggesting greater uncertainty or instability in the mismatch condition.
* The rate of decrease in surprisal slows down for both conditions as training progresses, indicating diminishing returns from further training.
### Interpretation
The chart suggests that the model learns to better predict or represent data when there is a "Match" (presumably between input and expected output). The decreasing surprisal indicates that the model is becoming more confident in its predictions for matched data. The higher surprisal values for "Mismatch" data suggest that the model finds it more difficult to predict or represent mismatched data. The fluctuations in the "Mismatch" line could indicate that the model is struggling to generalize from mismatched examples, or that the mismatched data is inherently more noisy or complex. The convergence of the lines at higher training steps suggests that the model is approaching a point of diminishing returns, where further training yields only marginal improvements in surprisal reduction. This data could be used to evaluate the effectiveness of a training regime, or to identify areas where the model could benefit from further refinement.
</details>
(b) Surprisal curves (w/ image).
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the progression of two key metrics â Information Gain and R² value â during a training process, plotted against the number of training steps. The chart displays how these metrics change as the training progresses from 0 to approximately 20,000 steps. The Information Gain is plotted on the right y-axis, while the R² value is plotted on the left y-axis.
### Components/Axes
* **X-axis:** "Training steps" ranging from 0 to 20000.
* **Left Y-axis:** "R² values" ranging from 0.00 to 1.00.
* **Right Y-axis:** "Information gain" ranging from 0 to 6.
* **Legend:** Located in the top-left corner, containing two entries:
* "Information gain" â represented by a blue line.
* "R² value" â represented by an orange line.
### Detailed Analysis
* **R² Value (Orange Line):** The orange line representing the R² value starts at approximately 0.00 at 0 training steps. It rapidly increases to a peak of approximately 0.65 at around 2000 training steps. After the peak, it gradually declines, stabilizing around 0.52 by 20000 training steps. The trend is initially steeply upward, then flattens and slightly decreases.
* **Information Gain (Blue Line):** The blue line representing Information Gain starts at approximately 0.00 at 0 training steps. It exhibits a steady, but slower, increase compared to the R² value. By 20000 training steps, the Information Gain reaches approximately 0.40. The trend is consistently upward, but with diminishing returns as the number of training steps increases.
### Key Observations
* The R² value peaks early in the training process and then plateaus, suggesting diminishing returns from further training in terms of model fit.
* Information Gain continues to increase throughout the entire training process, albeit at a decreasing rate.
* The R² value is consistently higher than the Information Gain throughout the training process.
* There is no clear point of convergence between the two metrics.
### Interpretation
The chart suggests that the model initially benefits significantly from training, as indicated by the rapid increase in the R² value. However, after a certain point (around 2000 training steps), further training provides diminishing returns in terms of improving the model's fit to the data (as measured by R²). The continuous increase in Information Gain suggests that the model is still learning and gaining new information, even as its ability to fit the training data plateaus. This could indicate that the model is becoming more complex and potentially overfitting to the training data, or that the remaining information gain is related to aspects of the data that are not directly captured by the R² value. The divergence between the two metrics suggests that optimizing for R² alone may not be sufficient to achieve optimal performance, and that considering Information Gain could provide a more comprehensive understanding of the training process. The plateau in R² could also indicate the need for regularization techniques to prevent overfitting.
</details>
(c) $R^{2}$ and information gain (w/ caption).
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the progression of two key metrics â Information Gain and R² value â during the training process. The x-axis represents "Training steps," while the left y-axis represents "R² values" and the right y-axis represents "Information gain." The chart displays how these metrics change as the training progresses from 0 to approximately 300,000 steps.
### Components/Axes
* **X-axis:** "Training steps" ranging from 0 to 300,000.
* **Left Y-axis:** "R² values" ranging from 0.00 to 1.00.
* **Right Y-axis:** "Information gain" ranging from 0 to 3.
* **Legend (Top-Left):**
* Blue Line: "Information gain"
* Orange Line: "R² value"
### Detailed Analysis
* **Information Gain (Blue Line):** The blue line representing Information Gain starts at approximately 0.05 at 0 training steps. It exhibits a generally upward trend, initially steep, then gradually flattening. At approximately 150,000 training steps, the Information Gain reaches around 0.45. It continues to increase, reaching approximately 0.55 at 300,000 training steps.
* **R² Value (Orange Line):** The orange line representing the R² value begins at approximately 0.02 at 0 training steps. It initially rises rapidly, reaching a peak of around 0.35 at approximately 75,000 training steps. After the peak, the R² value plateaus and then slowly declines, settling at approximately 0.25 at 300,000 training steps.
### Key Observations
* The Information Gain consistently increases throughout the training process, suggesting the model is continually learning and improving its ability to extract relevant information.
* The R² value increases initially, indicating a better fit of the model to the training data, but then plateaus and slightly decreases, potentially indicating overfitting or diminishing returns from further training.
* The two metrics exhibit inverse behavior to some extent. As Information Gain continues to rise, the rate of increase in R² value slows down and eventually declines.
### Interpretation
The chart suggests that while the model continues to learn (as indicated by increasing Information Gain), its ability to generalize to the training data (as measured by R² value) plateaus and eventually diminishes. This could indicate that the model is starting to overfit the training data, meaning it is learning the specific details of the training set rather than the underlying patterns. The divergence between the two curves after approximately 150,000 training steps is a key observation. It suggests that further training might not lead to significant improvements in the model's performance on unseen data and could even degrade its generalization ability. A potential next step would be to implement regularization techniques or early stopping to prevent overfitting and optimize the model's performance. The chart provides valuable insights into the training dynamics and helps identify the optimal point to stop training to achieve the best balance between learning and generalization.
</details>
(d) $R^{2}$ and information gain (w/ image).
Figure 4: Average surprisal of the experimental and control conditions in caption- and image-grounded dialogue settings, as well as the grounding information gain and its correlation to the co-occurrence of linguistic and environment tokens over training steps. All results are from a 12-layer Transformer model on grounded dialogue data.
We next test whether the grounding effects observed in CHILDES generalize to multimodal dialogue, using the Visual Dialog dataset. In this setting, the environmental ground is supplied either by captions or by image features (Table 1). For caption-grounded dialogue, the mismatch context is constructed in the same way as for CHILDES (Equation 2). For image-grounded dialogue, mismatch contexts are generated via Stable Diffusion 2 (Rombach et al., 2022) âbased image inpainting, which re-generates the region defined by the ground-truth mask corresponding to the target wordâs referent.
We train 12-layer Transformers with 5 random seeds. Similarly as Figures 2(a) â 2(b) and Figures 3(a) â 3(b), when captions serve as the environmental ground, Transformers show a clear surprisal gap between match and mismatch conditions (Figure 4(a)), with the grounding information gain increasing steadily while $R^{2}$ peaks early and declines (Figure 4(c)). Directly using image as grounds yields the same qualitative pattern (Figures 4(b) and 4(d)), although the observed effect is smaller. Both settings confirm that emergent grounding cannot be fully explained by co-occurrence statistics.
Overall, our findings demonstrate that Transformers are able to exploit environmental grounds in various modalities to facilitate linguistic prediction. The smaller but consistent gains in the image-grounded case suggest that while grounding from visual tokens is harder, the same architectural dynamics identified in textual testbeds still apply.
5 Mechanistic Explanation
In this section, we provide a mechanistic and interpretable account of the previous observation. We focus on a 12-layer Transformer trained on CHILDES with 5 random seeds, and defer broader generalization to the discussion.
<details>
<summary>x16.png Details</summary>

### Visual Description
## Heatmap: Layer vs. Steps
### Overview
The image presents a heatmap visualizing a relationship between "Steps" and "Layer". The color intensity represents a numerical value, ranging from approximately 0.05 to 0.30. The heatmap appears to represent some kind of iterative process or simulation, where the value changes with both the layer and the step.
### Components/Axes
* **X-axis:** "Layer", ranging from 1 to 12. The axis is labeled with integer values.
* **Y-axis:** "Steps", ranging from 0 to 20000, with increments of 150, 300, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, and 20000.
* **Color Scale (Legend):** Located on the right side of the heatmap. The scale ranges from approximately 0.05 (dark purple) to 0.30 (yellow). Intermediate values are: 0.10, 0.15, 0.20, 0.25.
### Detailed Analysis
The heatmap shows a complex pattern of values. Here's a breakdown of the observed trends:
* **Layer 1-7:** For the first 7 layers, the heatmap shows a generally decreasing trend in value as the number of steps increases. The values start around 0.25-0.30 at Step 0 and decrease to approximately 0.05-0.10 by Step 20000.
* **Layer 8:** Layer 8 exhibits a different pattern. It starts with a lower value (around 0.10-0.15) at Step 0 and increases to a peak around Step 5000-6000 (approximately 0.25-0.30) before decreasing again.
* **Layer 9-12:** Layers 9 through 12 show a similar pattern to Layer 8, with an initial low value, a peak around Step 5000-6000, and a subsequent decrease. The peak values are generally lower than those in Layer 8, ranging from approximately 0.15 to 0.20.
Here's a reconstruction of some approximate data points (values are estimates based on color matching):
| Layer | Step 0 | Step 1500 | Step 5000 | Step 10000 | Step 20000 |
|---|---|---|---|---|---|
| 1 | 0.28 | 0.22 | 0.12 | 0.08 | 0.06 |
| 2 | 0.27 | 0.21 | 0.11 | 0.07 | 0.05 |
| 3 | 0.26 | 0.20 | 0.10 | 0.06 | 0.05 |
| 4 | 0.25 | 0.19 | 0.09 | 0.05 | 0.04 |
| 5 | 0.24 | 0.18 | 0.08 | 0.04 | 0.04 |
| 6 | 0.23 | 0.17 | 0.07 | 0.03 | 0.03 |
| 7 | 0.22 | 0.16 | 0.06 | 0.03 | 0.03 |
| 8 | 0.12 | 0.18 | 0.27 | 0.15 | 0.08 |
| 9 | 0.10 | 0.15 | 0.22 | 0.12 | 0.06 |
| 10 | 0.09 | 0.14 | 0.20 | 0.11 | 0.05 |
| 11 | 0.08 | 0.13 | 0.18 | 0.10 | 0.04 |
| 12 | 0.07 | 0.12 | 0.16 | 0.09 | 0.03 |
### Key Observations
* The initial layers (1-7) exhibit a consistent decreasing trend.
* Layers 8-12 show a non-monotonic behavior, with an initial increase followed by a decrease.
* The peak values in Layers 8-12 are lower than the initial values in Layers 1-7.
* The values generally decrease as the number of steps increases, except for the peak observed in Layers 8-12.
### Interpretation
The heatmap likely represents the evolution of some metric over iterative steps for different layers of a system. The decreasing trend in the first seven layers could indicate a convergence or stabilization process. The subsequent increase and decrease in layers 8-12 suggest a more complex dynamic, potentially involving oscillations or a temporary increase in the metric before settling down.
The difference in behavior between the first seven layers and the later layers could be due to a change in the system's parameters or a transition to a different operating regime. The peak in layers 8-12 might represent a temporary overshoot or a local maximum in the metric's evolution.
Without further context, it's difficult to determine the exact meaning of the "Steps" and "Layer" variables and the metric being visualized. However, the heatmap provides valuable insights into the system's behavior and suggests potential areas for further investigation. The data suggests that the system is not uniformly stable across all layers and that the dynamics can change significantly depending on the layer number.
</details>
(a) Saliency of layer-wise attention from environmental to linguistic tokens across training steps.
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Layer for Different Training Steps
### Overview
This line chart depicts the relationship between 'Surprisal' and 'Layer' for three different training steps: 5000, 10000, and 20000. The chart shows how surprisal changes across layers for each training step.
### Components/Axes
* **X-axis:** Layer, ranging from 1 to 12.
* **Y-axis:** Surprisal, ranging from approximately 4.8 to 7.2.
* **Legend:** Located in the top-right corner, with the following entries:
* Blue line: step 5000
* Orange line: step 10000
* Green line: step 20000
### Detailed Analysis
* **Step 5000 (Blue Line):** The line starts at approximately 6.8 at Layer 1 and gradually decreases to approximately 6.3 at Layer 2, then plateaus around 6.3-6.5 for layers 2 through 12.
* Layer 1: Surprisal â 6.8
* Layer 2: Surprisal â 6.3
* Layer 3: Surprisal â 6.4
* Layer 4: Surprisal â 6.4
* Layer 5: Surprisal â 6.4
* Layer 6: Surprisal â 6.4
* Layer 7: Surprisal â 6.4
* Layer 8: Surprisal â 6.4
* Layer 9: Surprisal â 6.4
* Layer 10: Surprisal â 6.4
* Layer 11: Surprisal â 6.4
* Layer 12: Surprisal â 6.4
* **Step 10000 (Orange Line):** The line begins at approximately 6.2 at Layer 1 and decreases to approximately 5.8 at Layer 2. It then continues to decrease, but at a slower rate, reaching approximately 5.4 at Layer 12.
* Layer 1: Surprisal â 6.2
* Layer 2: Surprisal â 5.8
* Layer 3: Surprisal â 5.7
* Layer 4: Surprisal â 5.6
* Layer 5: Surprisal â 5.6
* Layer 6: Surprisal â 5.5
* Layer 7: Surprisal â 5.4
* Layer 8: Surprisal â 5.3
* Layer 9: Surprisal â 5.3
* Layer 10: Surprisal â 5.3
* Layer 11: Surprisal â 5.3
* Layer 12: Surprisal â 5.4
* **Step 20000 (Green Line):** The line starts at approximately 5.9 at Layer 1 and decreases to approximately 5.5 at Layer 2. It continues to decrease, reaching approximately 4.9 at Layer 12.
* Layer 1: Surprisal â 5.9
* Layer 2: Surprisal â 5.5
* Layer 3: Surprisal â 5.4
* Layer 4: Surprisal â 5.3
* Layer 5: Surprisal â 5.3
* Layer 6: Surprisal â 5.2
* Layer 7: Surprisal â 5.1
* Layer 8: Surprisal â 5.0
* Layer 9: Surprisal â 4.9
* Layer 10: Surprisal â 4.9
* Layer 11: Surprisal â 4.9
* Layer 12: Surprisal â 4.9
### Key Observations
* All three lines exhibit a decreasing trend in surprisal as the layer number increases, indicating that the model becomes more confident in its predictions as it processes information through deeper layers.
* The rate of decrease in surprisal is most pronounced in the initial layers (1-3) for all training steps.
* The surprisal values are highest at step 5000 and decrease with increasing training steps (10000 and 20000). This suggests that the model is learning and reducing its uncertainty as it is trained further.
* The difference in surprisal between the training steps is more significant in the earlier layers.
### Interpretation
The chart demonstrates the effect of training on a model's surprisal across different layers. Surprisal, in this context, can be interpreted as a measure of how unexpected or uncertain the model is about its predictions. As the model is trained for more steps (from 5000 to 20000), the surprisal generally decreases, indicating that the model is becoming more confident and accurate in its predictions.
The decreasing trend across layers suggests that deeper layers of the model are better at capturing and representing the underlying patterns in the data. The initial rapid decrease in surprisal in the early layers could be due to the model learning basic features and representations. The subsequent slower decrease in later layers might indicate that the model is refining its understanding and making more subtle distinctions.
The fact that surprisal is higher at step 5000 compared to steps 10000 and 20000 suggests that the model is still actively learning and improving its performance with more training. The convergence of the lines at higher layers indicates that the model is approaching a state of stability, where further training may not lead to significant improvements in performance.
</details>
(b) Layer-wise tuned lens to predict the $\langle$ LAN $\rangle$ token in match condition.
Figure 5: Overtime mechanistic analysis on GPT-CHILDES.
5.1 The Emergence of Symbol Grounding
To provide a mechanistic account of symbol grounding, i.e., when it emerges during training and how it is represented in the network, we apply two interpretability analyses.
Saliency flow. For each layer $\ell$ , we compute a saliency matrix following Wang et al. (2023): $I_{\ell}=\left|\sum_{h}A_{h,\ell}\odot\frac{â\mathcal{L}}{â A_{h,\ell}}\right|$ , where $A_{h,\ell}$ denotes the attention matrix of head $h$ in layer $\ell$ . Each entry of $I_{\ell}$ quantifies the contribution of the corresponding attention weight to the cross-entropy loss $\mathcal{L}$ , averaged across heads. Our analysis focuses on ground-to-symbol connections, i.e., flows from environmental ground ( $\langle$ ENV $\rangle$ ) tokens to the token immediately preceding (and predicting) their linguistic forms ( $\langle$ LAN $\rangle$ ).
Probing with the Tuned Lens. We probe layer-wise representations using the Tuned Lens (Belrose et al., 2023), which trains affine projectors to map intermediate activations to the final prediction space while keeping the LM output head frozen.
Results. Ground-to-symbol saliency is weak in the early stages of training but rises sharply later, peaking in layers 7â9 (Figure 5(a)), suggesting that mid-layer attention plays a central role in establishing symbolâground correspondences. In addition, Figure 5(b) shows that early layers remain poor predictors even at late training stages (e.g., after 20,000 steps), whereas surprisal begins to drop markedly from layer 7 at intermediate stages (step 10,000), suggesting a potential representational shift in the middle layers.
5.2 Hypothesis: Gather-and-Aggregate Heads Implement Symbol Grounding
Building on these results, we hypothesize that specific Transformer heads in the middle layers enable symbol grounding. To test this, we examine attention saliencies for selected heads (Figure 6). We find that several heads exhibit patterns consistent with the gather and aggregate mechanisms described by Bick et al. (2025): gather heads (e.g., Figures 6(a) and 6(b)) compress relevant information into a subset of positions, while aggregate heads (e.g., Figures 6(c) and 6(d)) redistribute this information to downstream tokens. In our setups, saliency often concentrates on environmental tokens such as train ${}_{\texttt{$\langle$ENV$\rangle$}}$ , where gather heads pool contextual information into compact, retrievable states. In turn, aggregate heads broadcast this information from environmental ground (train $\langle$ ENV $\rangle$ ) to the token immediately preceding the linguistic form, thereby supporting the prediction of train ${}_{\texttt{$\langle$LAN$\rangle$}}$ . Taking these observations together, we hypothesize that the gather-and-aggregate heads implement the symbol grounding mechanism.
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Heatmap: Co-occurrence Matrix of Linguistic Elements
### Overview
The image presents a heatmap visualizing the co-occurrence of linguistic elements categorized into three groups: `<CHI>` (Child), `<ENV>` (Environment), and `<LAN>` (Language). The heatmap displays the frequency or strength of association between words/phrases within and across these categories. The color intensity represents the degree of co-occurrence, with lighter shades indicating stronger relationships.
### Components/Axes
* **X-axis:** Represents the linguistic elements: `<CHI>`, "saw", "a", "train", "passing", "by", `<CHI>`, "i", "want", "to", "ride", "that".
* **Y-axis:** Represents the same linguistic elements as the X-axis: `<CHI>`, "saw", "a", "train", "passing", "by", `<CHI>`, "i", "want", "to", "ride", "that".
* **Color Scale:** Ranges from dark purple (low co-occurrence) to light teal/yellow (high co-occurrence).
* **Category Labels:** `<CHI>`, `<ENV>`, and `<LAN>` are used to group the linguistic elements. These are positioned on the left side of the heatmap.
* **Brackets:** Curly brackets are used to visually group the elements under each category label.
### Detailed Analysis
The heatmap shows varying degrees of co-occurrence between the linguistic elements. Here's a breakdown of notable observations:
* **`<CHI>` - `<CHI>`:** A moderate co-occurrence (teal color) is observed, suggesting the child frequently refers to themselves. Approximate value: 0.6.
* **`<CHI>` - "saw":** A strong co-occurrence (light teal) indicates the child often uses the verb "saw". Approximate value: 0.8.
* **`<CHI>` - "a":** Moderate co-occurrence (teal). Approximate value: 0.5.
* **`<CHI>` - "train":** Low co-occurrence (purple). Approximate value: 0.2.
* **`<CHI>` - "passing":** Low co-occurrence (purple). Approximate value: 0.1.
* **`<CHI>` - "by":** Low co-occurrence (purple). Approximate value: 0.1.
* **`<CHI>` - "i":** Moderate co-occurrence (teal). Approximate value: 0.6.
* **`<CHI>` - "want":** Low co-occurrence (purple). Approximate value: 0.2.
* **`<CHI>` - "to":** Low co-occurrence (purple). Approximate value: 0.1.
* **`<CHI>` - "ride":** Low co-occurrence (purple). Approximate value: 0.1.
* **`<CHI>` - "that":** Low co-occurrence (purple). Approximate value: 0.1.
* **"saw" - "a":** Strong co-occurrence (light teal). Approximate value: 0.9.
* **"saw" - "train":** Strong co-occurrence (light teal). Approximate value: 0.8.
* **"saw" - "passing":** Moderate co-occurrence (teal). Approximate value: 0.6.
* **"saw" - "by":** Moderate co-occurrence (teal). Approximate value: 0.5.
* **"saw" - `<CHI>`:** Strong co-occurrence (light teal). Approximate value: 0.8.
* **"saw" - "i":** Low co-occurrence (purple). Approximate value: 0.2.
* **"saw" - "want":** Low co-occurrence (purple). Approximate value: 0.1.
* **"saw" - "to":** Low co-occurrence (purple). Approximate value: 0.1.
* **"saw" - "ride":** Low co-occurrence (purple). Approximate value: 0.1.
* **"saw" - "that":** Low co-occurrence (purple). Approximate value: 0.1.
* **"train" - "passing":** Very strong co-occurrence (yellow). Approximate value: 1.0.
* **"train" - "by":** Moderate co-occurrence (teal). Approximate value: 0.5.
* **"train" - "a":** Moderate co-occurrence (teal). Approximate value: 0.6.
* **"want" - "to":** Strong co-occurrence (light teal). Approximate value: 0.8.
* **"want" - "ride":** Strong co-occurrence (light teal). Approximate value: 0.7.
* **"want" - "that":** Moderate co-occurrence (teal). Approximate value: 0.6.
* **"ride" - "that":** Moderate co-occurrence (teal). Approximate value: 0.5.
The `<LAN>` category shows a strong co-occurrence between "want", "to", and "ride". The `<ENV>` category shows a very strong co-occurrence between "train" and "passing".
### Key Observations
* The strongest co-occurrence is between "train" and "passing", suggesting these words frequently appear together in the context of this data.
* The child frequently uses "saw" and "a" in their utterances.
* The phrase "want to ride" exhibits a strong association, indicating a common desire expressed by the child.
* There is a relatively low co-occurrence between the `<CHI>` category and the `<LAN>` category, suggesting the child's language use is less directly linked to expressing desires ("want", "to", "ride") or referring to specific objects ("that").
### Interpretation
This heatmap likely represents a linguistic analysis of a child's speech, potentially collected during a language acquisition study. The co-occurrence matrix reveals patterns in how the child combines words and phrases. The strong association between "train" and "passing" suggests a specific event or observation that the child frequently discusses. The strong link between "want", "to", and "ride" indicates a common desire or request. The categorization into `<CHI>`, `<ENV>`, and `<LAN>` allows for an analysis of how the child refers to themselves, describes their environment, and expresses their language. The relatively weak connections between these categories might suggest a developmental stage where the child's self-reference and environmental descriptions are not yet fully integrated with their expression of desires or intentions. The heatmap provides a visual representation of the child's linguistic landscape, offering insights into their language development and cognitive processes.
</details>
(a) Gather: L4 H7.
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Heatmap: Linguistic Interaction Matrix
### Overview
The image presents a heatmap visualizing the interaction strength between linguistic elements. The heatmap displays the relationships between words/phrases categorized into three groups: `<CHI>` (Child), `<ENV>` (Environment), and `<LAN>` (Language). The intensity of the color represents the strength of the interaction, with lighter colors indicating stronger relationships.
### Components/Axes
* **X-axis:** Represents the linguistic elements: "saw", "a", "train", "passing", "by", "<CHI>", "i", "want", "to", "ride", "that". These are grouped under the categories `<CHI>`, `<ENV>`, and `<LAN>`.
* **Y-axis:** Identical to the X-axis, representing the same linguistic elements, also grouped under `<CHI>`, `<ENV>`, and `<LAN>`.
* **Color Scale:** Ranges from dark purple (weak interaction) to yellow (strong interaction).
* **Category Labels:** `<CHI>`, `<ENV>`, `<LAN>` are displayed on the left and bottom of the heatmap, grouping the words.
* **Brackets:** Curly brackets indicate the grouping of words under each category.
### Detailed Analysis
The heatmap shows the interaction strength between each pair of linguistic elements. The values are approximate, based on color intensity.
* **`<CHI>` vs. `<CHI>`:**
* "<CHI>" and "saw": Moderate interaction (light blue, ~0.6).
* "<CHI>" and "i": Strong interaction (yellow, ~0.8).
* **`<ENV>` vs. `<ENV>`:**
* "train" and "passing": Very strong interaction (bright yellow, ~0.9).
* "train" and "by": Moderate interaction (light blue, ~0.6).
* "passing" and "by": Moderate interaction (light blue, ~0.6).
* **`<LAN>` vs. `<LAN>`:**
* "want" and "to": Strong interaction (yellow, ~0.8).
* "to" and "ride": Strong interaction (yellow, ~0.8).
* "ride" and "that": Moderate interaction (light blue, ~0.6).
* **Cross-Category Interactions:**
* "<CHI>" and "train": Weak interaction (dark purple, ~0.2).
* "<CHI>" and "passing": Weak interaction (dark purple, ~0.2).
* "i" and "train": Weak interaction (dark purple, ~0.2).
* "i" and "passing": Weak interaction (dark purple, ~0.2).
* "want" and "train": Weak interaction (dark purple, ~0.2).
* "ride" and "train": Weak interaction (dark purple, ~0.2).
* "that" and "train": Weak interaction (dark purple, ~0.2).
### Key Observations
* The strongest interactions occur within the `<ENV>` category, specifically between "train" and "passing".
* Strong interactions also exist within `<CHI>` ("<CHI>" and "i") and `<LAN>` ("want" and "to", "to" and "ride").
* Interactions between categories are generally weak, suggesting limited direct relationships between the child's utterances, the environmental context, and the language used.
* The diagonal elements (self-interactions) are generally stronger than off-diagonal elements, indicating that elements are more strongly related to themselves.
### Interpretation
This heatmap likely represents a co-occurrence or association matrix derived from a corpus of child language data. The `<CHI>`, `<ENV>`, and `<LAN>` categories suggest an analysis of how a child's utterances relate to the surrounding environment and the linguistic structure of their speech.
The strong interaction between "train" and "passing" within the `<ENV>` category suggests that these words frequently appear together in the context of the child's environment. This could indicate that the child is observing and describing trains passing by.
The strong interactions within `<CHI>` and `<LAN>` suggest that the child's self-reference ("i") is strongly linked to their desires ("want") and actions ("ride").
The weak cross-category interactions suggest that the child's utterances are not strongly tied to the environment or the broader linguistic context. This could indicate that the child is primarily focused on their own internal states and desires, or that the data collection method did not capture sufficient environmental or linguistic context.
The heatmap provides a visual representation of the relationships between linguistic elements, allowing for a quick identification of strong and weak associations. This information can be used to gain insights into the child's language development and their understanding of the world around them.
</details>
(b) Gather: L4 H8.
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Heatmap: Attention Matrix
### Overview
The image presents a heatmap visualizing an attention matrix. The matrix appears to represent the relationships between words in a sequence, likely from a dialogue or narrative. The words are categorized into three groups: `<CHI>` (Child), `<ENV>` (Environment), and `<LAN>` (Language). The heatmap's color intensity indicates the strength of the attention or relationship between the corresponding words.
### Components/Axes
* **X-axis:** Lists the words: `<CHI>`, "saw", "a", "train", "passing", "by", `<CHI>`, "i", "want", "to", "ride", "that".
* **Y-axis:** Lists the words: `<CHI>`, "saw", "a", "train", "passing", "by", `<CHI>`, "i", "want", "to", "ride", "that".
* **Color Scale:** Ranges from dark purple (low attention) to bright yellow (high attention).
* **Categorical Labels:** `<CHI>`, `<ENV>`, and `<LAN>` are used to group words. `<CHI>` represents the child's utterances, `<ENV>` represents the environmental context, and `<LAN>` represents the language used.
### Detailed Analysis
The heatmap shows varying degrees of attention between words. Here's a breakdown of notable observations:
* **`<CHI>` - `<CHI>`:** There is a strong attention between the two instances of `<CHI>` in the top-left and center-left of the matrix (bright yellow). This suggests a strong self-reference or focus on the child's perspective.
* **"saw" - "train":** Moderate attention (light blue) between "saw" and "train", indicating a relationship where the child observed the train.
* **"train" - "passing":** Moderate attention (light blue) between "train" and "passing", indicating a relationship where the train was in motion.
* **"passing" - "by":** Moderate attention (light blue) between "passing" and "by", indicating a relationship where the train was moving past.
* **"i" - "want":** Strong attention (bright yellow) between "i" and "want", indicating the child's desire.
* **"want" - "to":** Moderate attention (light blue) between "want" and "to", indicating a grammatical relationship.
* **"to" - "ride":** Strong attention (bright yellow) between "to" and "ride", indicating the action the child wants to perform.
* **"ride" - "that":** Moderate attention (light blue) between "ride" and "that", indicating the object of the child's desire.
* **`<CHI>` - "i":** Moderate attention (light blue) between the first `<CHI>` and "i", suggesting the child is the speaker.
* **`<CHI>` - "want":** Low attention (dark purple) between the first `<CHI>` and "want", suggesting a weaker direct connection.
* **"saw" - "i":** Very low attention (dark purple) between "saw" and "i".
* **"a" - "train":** Very low attention (dark purple) between "a" and "train".
* **"by" - "i":** Very low attention (dark purple) between "by" and "i".
### Key Observations
* The strongest attention relationships are within the child's utterances ("i want to ride") and between the two instances of `<CHI>`.
* The environmental context ("train passing by") shows moderate attention between its constituent words.
* There's a clear separation in attention patterns between the `<CHI>`, `<ENV>`, and `<LAN>` categories.
* The attention matrix is roughly symmetric, suggesting bidirectional relationships between words.
### Interpretation
This attention matrix likely represents the focus of a child's attention during a conversation or observation. The strong attention within the child's desires ("i want to ride") suggests a high level of internal focus on their goals. The moderate attention within the environmental context indicates the child is processing the scene. The relatively low attention between the child's utterances and the environment suggests the child is primarily focused on their own desires rather than detailed observation of the train. The use of `<CHI>`, `<ENV>`, and `<LAN>` categories suggests a linguistic analysis of the child's language and how they relate to their environment. This could be used to understand how children process information and form intentions. The heatmap provides a visual representation of these relationships, allowing for a quick assessment of the child's attentional focus.
</details>
(c) Aggregate: L7 H5.
<details>
<summary>x21.png Details</summary>

### Visual Description
## Heatmap: Co-occurrence Matrix of Linguistic Elements
### Overview
The image presents a heatmap visualizing the co-occurrence of linguistic elements, specifically words and phrases, categorized into three groups: `<CHI>` (Child), `<ENV>` (Environment), and `<LAN>` (Language). The heatmap displays the strength of association between each pair of elements using a color gradient, ranging from dark purple (low co-occurrence) to yellow (high co-occurrence).
### Components/Axes
* **X-axis:** Represents the linguistic elements: `<CHI>`, "saw", "a", "train", "passing", "by", `<CHI>`, "i", "want", "to", "ride", "that".
* **Y-axis:** Represents the same linguistic elements as the X-axis: `<CHI>`, "saw", "a", "train", "passing", "by", `<CHI>`, "i", "want", "to", "ride", "that".
* **Color Scale:** Dark purple indicates low co-occurrence, transitioning through shades of blue and green, to yellow indicating high co-occurrence.
* **Category Labels:** `<CHI>`, `<ENV>`, and `<LAN>` are used to group the linguistic elements. `<CHI>` appears twice on both axes.
* **Brackets:** Curly brackets indicate the grouping of elements under each category.
### Detailed Analysis
The heatmap shows the following co-occurrence patterns:
* **`<CHI>` - `<CHI>`:** A strong co-occurrence (yellow) is observed between the two instances of `<CHI>` on the X and Y axes. This suggests a high self-reference within the child's utterances. Approximate value: 0.9.
* **"train" - "train":** A strong co-occurrence (yellow) is observed between the two instances of "train" on the X and Y axes. Approximate value: 0.9.
* **"saw" - `<CHI>`:** Moderate co-occurrence (blue). Approximate value: 0.5.
* **`<CHI>` - "saw":** Moderate co-occurrence (blue). Approximate value: 0.5.
* **"train" - "passing":** Moderate co-occurrence (blue). Approximate value: 0.5.
* **"passing" - "train":** Moderate co-occurrence (blue). Approximate value: 0.5.
* **"ride" - "that":** Moderate co-occurrence (blue). Approximate value: 0.5.
* **"that" - "ride":** Moderate co-occurrence (blue). Approximate value: 0.5.
* **"want" - "to":** Moderate co-occurrence (blue). Approximate value: 0.5.
* **"to" - "want":** Moderate co-occurrence (blue). Approximate value: 0.5.
* **"i" - "want":** Moderate co-occurrence (blue). Approximate value: 0.5.
* **"want" - "i":** Moderate co-occurrence (blue). Approximate value: 0.5.
* All other co-occurrences are very low (dark purple), with approximate values close to 0.
### Key Observations
* The highest co-occurrence is between the two instances of `<CHI>` and "train".
* The `<CHI>` category shows strong internal coherence.
* The `<ENV>` category ("train", "passing", "by") exhibits moderate co-occurrence with `<CHI>` and with each other.
* The `<LAN>` category ("i", "want", "to", "ride", "that") shows moderate co-occurrence within itself, particularly between "want" and "to", and "ride" and "that".
* There is a clear separation between the categories, with minimal co-occurrence across category boundaries.
### Interpretation
This heatmap likely represents a linguistic analysis of a child's speech or narrative. The high co-occurrence of `<CHI>` with itself suggests the child frequently refers to themselves. The strong co-occurrence of "train" with itself suggests the train is a central topic. The moderate co-occurrence between `<CHI>` and elements within `<ENV>` and `<LAN>` indicates the child is describing an environment and expressing desires related to it. The heatmap demonstrates how the child's language is structured around specific themes and self-reference. The lack of strong co-occurrence between categories suggests a relatively compartmentalized linguistic structure, where elements within each category are more strongly associated with each other than with elements in other categories. The data suggests the child is narrating an event involving a train and expressing a desire to interact with it.
</details>
(d) Aggregate: L8 H5.
Figure 6: Examples of gather and aggregate heads identified in GPT-CHILDES. L: layer; H: head.
Table 2: Causal intervention results on identified gather and aggregate heads across training checkpoints (ckpt.). Avg. Count denotes the average number of heads of each type over inference times, and Avg. Layer denotes the average layer index where they appear. Interv. Sps. reports surprisal after zeroing out the identified heads, while Ctrl. Sps. reports surprisal after zeroing out an equal number of randomly selected heads. Original refers to the baseline surprisal without any intervention. *** indicates a significant result ( $p<0.001$ ) where the intervention surprisal is higher than that in the corresponding control experiment.
| Ckpt. | Gather Head | Aggregate Head | Original | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Avg. | Avg. | Interv. | Ctrl. | Avg. | Avg. | Interv. | Ctrl. | | |
| Count | Layer | Sps. | Sps. | Count | Layer | Sps. | Sps. | | |
| 500 | 0.00 | - | - | - | 0.07 | 8.74 | 9.34 | 9.34 | 9.34 |
| 5000 | 0.35 | 3.32 | 6.37 | 6.38 | 2.28 | 7.38 | 6.51 | 6.39 | 6.38 |
| (***) | | | | | | | | | |
| 10000 | 3.26 | 3.67 | 5.25 | 5.32 | 5.09 | 7.28 | 5.86 | 5.29 | 5.30 |
| (***) | | | | | | | | | |
| 20000 | 5.76 | 3.59 | 4.69 | 4.79 | 6.71 | 7.52 | 5.62 | 4.76 | 4.77 |
| (***) | | | | | | | | | |
5.3 Causal Interventions of Attention Heads
We then conduct causal interventions of attention heads to validate our previous hypothesis.
Operational definition. We identify attention heads as gather or aggregate following these standards:
- Gather head. An attention head is classified as a gather head if at least 30% of its total saliency is directed toward the environmental ground token from the previous ones.
- Aggregate head: An attention head is classified as an aggregate head if at least 30% of its total saliency flows from the environmental ground token to the token immediately preceding the corresponding linguistic token.
Causal intervention methods. In each context, we apply causal interventions to the identified head types and their corresponding controls. Following Bick et al. (2025), interventions are implemented by zeroing out the outputs of heads. For the control, we mask an equal number of randomly selected heads in each layer, ensuring they do not overlap with the identified gather or aggregate heads.
| Thres. | Ckpt. | Aggregate Head | Original | | | |
| --- | --- | --- | --- | --- | --- | --- |
| Avg. | Avg. | Interv. | Ctrl. | | | |
| Count | Layer | Sps. | Sps. | | | |
| 70% | 20k | 32.30 | 7.78 | 9.96 | 9.95 | 9.21 |
| 100k | 35.63 | 7.71 | 9.42 | 8.84 | 8.24 | |
| (***) | | | | | | |
| 200k | 34.99 | 7.80 | 8.95 | 8.15 | 7.76 | |
| (***) | | | | | | |
| 300k | 34.15 | 7.76 | 8.96 | 8.11 | 7.69 | |
| (***) | | | | | | |
| 90% | 20k | 10.66 | 8.33 | 9.51 | 9.43 | 9.21 |
| (***) | | | | | | |
| 100k | 13.90 | 8.26 | 8.95 | 8.50 | 8.24 | |
| (***) | | | | | | |
| 200k | 13.47 | 8.46 | 8.41 | 7.88 | 7.76 | |
| (***) | | | | | | |
| 300k | 12.73 | 8.42 | 8.40 | 7.87 | 7.69 | |
| (***) | | | | | | |
<details>
<summary>x22.png Details</summary>

### Visual Description
## Heatmap: Layer Activation vs. Training Steps
### Overview
The image presents a heatmap visualizing the relationship between training steps and layer activation. The heatmap displays activation levels (represented by color intensity) across different layers of a model as training progresses through various steps. The color scale ranges from 0 (dark purple) to 0.008 (light yellow/orange).
### Components/Axes
* **X-axis:** "Layer" - Discrete values from 1 to 12.
* **Y-axis:** "Steps" - Discrete values in increments of 30k, ranging from 30k to 300k. Specifically: 30k, 60k, 90k, 120k, 150k, 180k, 210k, 240k, 270k, 300k.
* **Color Scale:** A gradient from dark purple (0) to light yellow/orange (0.008). This represents the activation level.
* **Legend:** Located at the bottom of the image, showing the color mapping to activation values.
### Detailed Analysis
The heatmap shows a clear pattern of increasing activation as training steps increase and as the layer number increases.
Here's a breakdown of approximate activation values based on color and the legend:
* **Layer 1:** Activation remains consistently low (approximately 0-0.001) across all steps.
* **Layer 2:** Activation remains consistently low (approximately 0-0.001) across all steps.
* **Layer 3:** Activation remains consistently low (approximately 0-0.001) across all steps.
* **Layer 4:** Activation remains consistently low (approximately 0-0.001) across all steps.
* **Layer 5:** Activation begins to increase gradually with steps, reaching approximately 0.002 at 300k steps.
* **Layer 6:** Activation increases more rapidly with steps, reaching approximately 0.004 at 300k steps.
* **Layer 7:** Activation increases rapidly with steps, reaching approximately 0.006 at 300k steps.
* **Layer 8:** Activation increases rapidly with steps, reaching approximately 0.007 at 300k steps.
* **Layer 9:** Activation peaks around 210k-240k steps, reaching approximately 0.008, then slightly decreases.
* **Layer 10:** Activation peaks around 210k-240k steps, reaching approximately 0.008, then slightly decreases.
* **Layer 11:** Activation peaks around 210k-240k steps, reaching approximately 0.008, then slightly decreases.
* **Layer 12:** Activation peaks around 210k-240k steps, reaching approximately 0.008, then slightly decreases.
The highest activation levels (around 0.008) are concentrated in layers 9, 10, 11, and 12, specifically between 210k and 240k steps. The activation appears to plateau or slightly decrease after these steps.
### Key Observations
* Early layers (1-5) exhibit consistently low activation throughout training.
* Activation increases with both layer number and training steps.
* Layers 9-12 show the highest activation levels, suggesting these layers are most actively involved in the model's learning process.
* There's a potential saturation effect in layers 9-12 after approximately 240k steps, where activation levels plateau or slightly decrease.
### Interpretation
This heatmap likely represents the activation of neurons in different layers of a neural network during training. The increasing activation with both layer number and training steps suggests that deeper layers become more engaged as the model learns. The peak activation in layers 9-12 indicates that these layers are crucial for the model's final output or decision-making process.
The plateau or slight decrease in activation after 240k steps in the later layers could indicate several things:
1. **Convergence:** The model may be converging, and further training doesn't significantly improve activation in these layers.
2. **Overfitting:** The model might be starting to overfit the training data, leading to diminishing returns in activation.
3. **Vanishing/Exploding Gradients:** Although less likely given the overall trend, it's possible that gradient issues are affecting activation in these layers.
The fact that the earlier layers remain relatively inactive throughout training suggests they might be responsible for extracting basic features, while the later layers combine these features to form more complex representations. Further analysis, such as examining the weights and gradients of each layer, would be needed to confirm these interpretations.
</details>
Figure 7: Mechanistic analysis in the image-grounded visual dialogue setting. Left: Causal intervention results on identified aggregate heads across training checkpoints, where intervention on aggregate heads consistently yields significantly higher surprisal ( $p<0.001$ , ***) compared to the control group ones. Right: Saliency of layer-wise attention from environmental tokens (i.e., image tokens corresponding to patches within the bounding boxes of the target object) to linguistic tokens across training steps.
Results and discussions. As training progresses, the number of both gather and aggregate heads increases (Table 2), suggesting that these mechanisms emerge over the course of learning. Causal interventions reveal a clear dissociation: zeroing out aggregate heads consistently produces significantly higher surprisal compared to controls, whereas the gather head interventions have no such effect. This asymmetry suggests that gather heads serve in a role less critical in our settings, where the input template is semantically light and the environmental evidence alone suffices to shape the linguistic form. Layer-wise patterns further support this division of labor: gather heads cluster in shallow layers (3-4), while aggregate heads concentrate in mid layers (7-8). This resonates with our earlier probing results, where surprisal reductions became prominent only from layers 7-9. Together, these findings highlight aggregate heads in the middle layers as the primary account of grounding in the model.
5.4 Generalization to Visual Dialog with Images
We also conduct causal interventions of attention heads on the VLM model to further validate our previous hypothesis.
Operational definition. We identify attention heads as aggregate following this standard (We do not define gather head): An attention head is classified as an aggregate head if at least a certain threshold (70% or 90% in our experiment settings) of its total image patch to end saliency flows from the patches inside bounding box to the token immediately preceding the corresponding linguistic token.
Causal intervention methods. In each context, we apply causal interventions to the identified head types and their corresponding controls in the language backbone of the model. Similar to section 5.3, interventions are implemented by zeroing out a headâs outputs. For the control, we mask an equal number of randomly selected heads in each layer, ensuring they do not overlap with the identified gather or aggregate heads.
Results and discussions. As training progresses, the number of aggregate heads increases first and then becomes steady (Figure 7), suggesting that these mechanisms emerge over the course of learning. Causal interventions reveal that zeroing out aggregate heads consistently produces significantly higher surprisal rises compared to controls. The average layer also align with the saliency heatmap, also shown in Figure 7.
6 Discussions
Generalization to full-scale VLMs. As an additional case study, we extend our grounding-as-aggregation hypothesis to a full-scale VLM, LLaVA-1.5-7B (Liu et al., 2023). Even in this heavily engineered architecture, we identify many attention heads exhibiting aggregation behavior consistent with our earlier findings (Figure 1(b)), reinforcing the view that symbol grounding arises from specialized heads. At the same time, full-scale VLMs present additional complications. Models like LLaVA use multiple sets of visual tokens, including CLIP-derived embeddings that already encode language priors, and global information may be stored in redundant artifact tokens rather than object-centric regions (Darcet et al., 2024). Moreover, the large number of visual tokens (environmental tokens, in our setup) substantially increases both computational cost and the difficulty of isolating genuine aggregation heads. These factors make systematic identification and intervention at scale a nontrivial challenge. For these reasons, while our case study highlights promising evidence of grounding heads in modern VLMs, systematic detection and causal evaluation of such heads at scale remains an open challenge. Future work will need to develop computationally viable methods for (i) automatically detecting aggregation heads across diverse VLMs, and (ii) applying causal interventions to validate their role in grounding. Addressing these challenges will be crucial for moving from anecdotal case studies to a more principled understanding of grounding in modern VLMs.
The philosophical roots of grounding, revisited. Our findings highlight the need to sharpen the meaning of grounding in multimodal models. Prior work has often equated grounding with statistical correlations between visual and textual signals, such as attention overlaps or geometric alignments (Bousselham et al., 2024; Cao et al., 2025; Schnaus et al., 2025). While informative, such correlations diverge from the classic formulation by Harnad (1990), which requires symbols to be causally anchored to their referents in the environment. On the other extreme, Gubelmann (2024) argued that the symbol grounding problem does not apply to LLMs as they âare connectionist, statistical devices that have no intrinsic symbolic structure.â In contrast, we discover emergent symbolic structure as an intrinsic mechanistic property: one that can be traced along training, observed in the specialization of attention heads, and validated through causal interventions. This provides not only a practical diagnostic protocol that reveals when and how models genuinely tie symbols to meaning beyond surface-level correlations, but also challenges the view that grounding is philosophically irrelevant to systems without explicit symbolic structure.
Practical implications to LM hallucinations. Our findings have practical implications for improving the reliability of LM outputs: by identifying aggregation heads that mediate grounding between environmental and linguistic tokens, we provide a promising mechanism to detect model reliability before generation. Our findings echo a pathway to mitigate hallucinations by focusing on attention control: many hallucination errors stem from misallocated attention in intermediate layers (Jiang et al., 2025; Chen et al., 2024b). Such attention-level signals can serve as early indicators of overtrust or false grounding, motivating practical solutions like decoding-time strategies to mitigate and eventually prevent hallucination (Huang et al., 2024).
Acknowledgement
This work was supported in part by NSF IIS-1949634, NSF SES-2128623, NSERC RGPIN-2024-04395, the Weinberg Cognitive Science Fellowship to ZM, a Vector Scholarship to XL, and a Canada CIFAR AI Chair award to FS. The authors would like to thank Songlin Yang and Jing Ding for their valuable feedback.
References
- Anthropic (2024) Anthropic. The claude 3 model family: Opus, sonnet, haiku, March 2024. URL https://www.anthropic.com/news/claude-3-family.
- Arora et al. (2025) Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, RĂłbert CsĂłrdas, Dan Jurafsky, and Christopher Potts. Mechanistic evaluation of transformers and state space models. arXiv preprint arXiv:2505.15105, 2025.
- Belrose et al. (2023) Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023.
- Bick et al. (2025) Aviv Bick, Eric P. Xing, and Albert Gu. Understanding the skill gap in recurrent models: The role of the gather-and-aggregate mechanism. In Forty-second International Conference on Machine Learning, 2025.
- Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle OâBrien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397â2430. PMLR, 2023.
- Bietti et al. (2023) Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 2023.
- Blevins et al. (2022) Terra Blevins, Hila Gonen, and Luke Zettlemoyer. Analyzing the mono-and cross-lingual pretraining dynamics of multilingual language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3575â3590, 2022.
- Bousselham et al. (2024) Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localization properties in vision-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3828â3837, 2024.
- Cao et al. (2025) Shengcao Cao, Liang-Yan Gui, and Yu-Xiong Wang. Emerging pixel grounding in large multimodal models without grounding supervision. In International Conference on Machine Learning, 2025.
- Chang & Bergen (2022) Tyler A Chang and Benjamin K Bergen. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10:1â16, 2022.
- Chang et al. (2024) Tyler A Chang, Zhuowen Tu, and Benjamin K Bergen. Characterizing learning curves during language model pre-training: Learning, forgetting, and stability. Transactions of the Association for Computational Linguistics, 12:1346â1362, 2024.
- Chen et al. (2024a) Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. arXiv preprint arXiv:2406.16866, 2024a.
- Chen et al. (2023) Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llmâs referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Chen et al. (2024b) Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, and Joyce Chai. Multi-object hallucination in vision language models. Advances in Neural Information Processing Systems, 37:44393â44418, 2024b.
- Clark (1995) Eve V Clark. The lexicon in acquisition. Number 65. Cambridge University Press, 1995.
- Comanici et al. (2025) Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025.
- Dao & Gu (2024) Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, pp. 10041â10071. PMLR, 2024.
- Darcet et al. (2024) TimothĂŠe Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Representations, 2024.
- Das et al. (2017) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, JosĂŠ MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 326â335, 2017.
- Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Evanson et al. (2023) Linnea Evanson, Yair Lakretz, and Jean-RĂŠmi King. Language acquisition: do children and language models follow similar learning stages? In Findings of the Association for Computational Linguistics: ACL 2023, pp. 12205â12218, 2023.
- Fazly et al. (2010) Afsaneh Fazly, Afra Alishahi, and Suzanne Stevenson. A probabilistic computational model of cross-situational word learning. Cognitive Science, 34(6):1017â1063, 2010.
- Fenson et al. (2006) Larry Fenson, Virginia A Marchman, Donna J Thal, Phillip S Dale, J Steven Reznick, and Elizabeth Bates. Macarthur-bates communicative development inventories. PsycTESTS Dataset, 2006.
- Gleitman & Landau (1994) Lila R Gleitman and Barbara Landau. The acquisition of the lexicon. MIT Press, 1994.
- Goodman et al. (2007) Noah Goodman, Joshua Tenenbaum, and Michael Black. A bayesian framework for cross-situational word-learning. Advances in neural information processing systems, 20, 2007.
- Gubelmann (2024) Reto Gubelmann. Pragmatic norms are all you needâwhy the symbol grounding problem does not apply to llms. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 11663â11678, 2024.
- Hagendorff (2023) Thilo Hagendorff. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. arXiv preprint arXiv:2303.13988, 2023.
- Harnad (1990) Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335â346, 1990.
- Hochreiter & Schmidhuber (1997) Sepp Hochreiter and JĂźrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735â1780, 1997.
- Huang et al. (2024) Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13418â13427, 2024.
- Jiang et al. (2025) Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 25004â25014, 2025.
- Kangaslahti et al. (2025) Sara Kangaslahti, Elan Rosenfeld, and Naomi Saphra. Hidden breakthroughs in language model training. arXiv preprint arXiv:2506.15872, 2025.
- Li et al. (2022) Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965â10975, 2022.
- Lieberum et al. (2023) Tom Lieberum, Matthew Rahtz, JĂĄnos KramĂĄr, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik. Does circuit analysis interpretability scale? Evidence from multiple choice capabilities in chinchilla. arXiv preprint arXiv:2307.09458, 2023.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr DollĂĄr, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740â755. Springer, 2014.
- Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in neural information processing systems, volume 36, pp. 34892â34916, 2023.
- Lu et al. (2024) Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, and Iryna Gurevych. Are emergent abilities in large language models just in-context learning? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5098â5139, 2024.
- Ma et al. (2023) Ziqiao Ma, Jiayi Pan, and Joyce Chai. World-to-words: Grounded open vocabulary acquisition through fast mapping in vision-language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 524â544, 2023.
- Ma et al. (2025) Ziqiao Ma, Zekun Wang, and Joyce Chai. Babysit a language model from scratch: Interactive language learning by trials and demonstrations. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 991â1010, 2025.
- MacWhinney (2000) Brian MacWhinney. The childes project: Tools for analyzing talk: Volume i: Transcription format and programs, volume ii: The database, 2000.
- Mao et al. (2019) Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, sentences from natural supervision. International Conference on Learning Representations (ICLR), 2019.
- Mao et al. (2021) Jiayuan Mao, Freda H. Shi, Jiajun Wu, Roger P. Levy, and Joshua B. Tenenbaum. Grammar-based grounded lexicon learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
- Meng et al. (2022) Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, 2022.
- Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- OpenAI (2024) OpenAI. Hello gpt-4o, May 2024. URL https://openai.com/index/hello-gpt-4o/.
- Oquab et al. (2024) Maxime Oquab, TimothĂŠe Darcet, ThĂŠo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal, pp. 1â31, 2024.
- Peng et al. (2024) Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Grounding multimodal large language models to the world. In The Twelfth International Conference on Learning Representations, 2024.
- Pratt et al. (2020) Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, and Aniruddha Kembhavi. Grounded situation recognition. In European Conference on Computer Vision, pp. 314â332. Springer, 2020.
- Qu & Chai (2010) Shaolin Qu and Joyce Yue Chai. Context-based word acquisition for situated dialogue in a virtual world. Journal of Artificial Intelligence Research, 37:247â277, 2010.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748â8763. PmLR, 2021.
- Rasheed et al. (2024) Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Regier (2005) Terry Regier. The emergence of words: Attentional learning in form and meaning. Cognitive science, 29(6):819â865, 2005.
- Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and BjĂśrn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684â10695, 2022.
- Roy & Pentland (2002) Deb K Roy and Alex P Pentland. Learning words from sights and sounds: A computational model. Cognitive science, 26(1):113â146, 2002.
- Sabet et al. (2020) Masoud Jalili Sabet, Philipp Dufter, François Yvon, and Hinrich Schßtze. Simalign: High quality word alignments without parallel training data using static and contextualized embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
- Schaeffer et al. (2023) Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36, 2023.
- Schnaus et al. (2025) Dominik Schnaus, Nikita Araslanov, and Daniel Cremers. Itâs a (blind) match! Towards vision-language correspondence without parallel data. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 24983â24992, 2025.
- Sellam et al. (2021) Thibault Sellam, Steve Yadlowsky, Ian Tenney, Jason Wei, Naomi Saphra, Alexander DâAmour, Tal Linzen, Jasmijn Bastings, Iulia Raluca Turc, Jacob Eisenstein, et al. The multiberts: Bert reproductions for robustness analysis. In International Conference on Learning Representations, 2021.
- Shi et al. (2021) Haoyue Shi, Luke Zettlemoyer, and Sida I. Wang. Bilingual lexicon induction via unsupervised bitext construction and word alignment. In ACL, 2021.
- Siskind (1996) Jeffrey Mark Siskind. A computational study of cross-situational techniques for learning word-to-meaning mappings. Cognition, 61(1-2):39â91, 1996.
- van der Wal et al. (2025) Oskar van der Wal, Pietro Lesci, Max MĂźller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. Polypythias: Stability and outliers across fifty language model pre-training runs. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025), pp. 1â25, 2025.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ĺukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Wang et al. (2023) Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9840â9855, 2023.
- Wang et al. (2024) Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. Cogvlm: Visual expert for pretrained language models. Advances in Neural Information Processing Systems, 37:121475â121499, 2024.
- Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
- Wiegreffe et al. (2025) Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov, Hannaneh Hajishirzi, and Ashish Sabharwal. Answer, assemble, ace: Understanding how LMs answer multiple choice questions. In The Thirteenth International Conference on Learning Representations, 2025.
- Wu et al. (2025a) Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality. In The Thirteenth International Conference on Learning Representations, 2025a.
- Wu et al. (2025b) Zhaofeng Wu, Dani Yogatama, Jiasen Lu, and Yoon Kim. The semantic hub hypothesis: Language models share semantic representations across languages and modalities. In ICML, 2025b.
- Xia et al. (2023) Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Ves Stoyanov. Training trajectories of language models across scales. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13711â13738, 2023.
- Xia et al. (2024) Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Xu & Tenenbaum (2007) Fei Xu and Joshua B Tenenbaum. Word learning as bayesian inference. Psychological review, 114(2):245, 2007.
- You et al. (2024) Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. In The Twelfth International Conference on Learning Representations, 2024.
- Yu (2005) Chen Yu. The emergence of links between lexical acquisition and object categorization: A computational study. Connection science, 17(3-4):381â397, 2005.
- Yu & Ballard (2007) Chen Yu and Dana H Ballard. A unified model of early word learning: Integrating statistical and social cues. Neurocomputing, 70(13-15):2149â2165, 2007.
- Yu & Siskind (2013) Haonan Yu and Jeffrey Mark Siskind. Grounded language learning from video described with sentences. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 53â63, 2013.
- Zhang et al. (2024a) Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding. Advances in neural information processing systems, 37:71737â71767, 2024a.
- Zhang et al. (2024b) Yichi Zhang, Ziqiao Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi Gao, and Joyce Chai. Groundhog: Grounding large language models to holistic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024b.
- Zhao et al. (2024) Rosie Zhao, Naomi Saphra, and Sham M. Kakade. Distributional scaling laws for emergent capabilities. In NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024.
Appendix A Dataset Details
A.1 Context Templates
We select the target tokens following the given procedure:
1. Get a list of words, with their ENV and LAN frequency both greater than or equal to 100 in the CHILDES dataset;
1. Get another list of nouns from CDI;
1. Take intersection and select top 100 words (by frequency of their ENV token) as target token list.
In CHILDES, all contexts are created with gpt-4o-mini followed by human verification if the genrated contexts are semantically light. We adopt the following prompt:
Prompt Templates for CHILDES
Given the word â{word}â, create 3 pairs of sentences that follow this requirement: 1. The first sentence has a subject âThe childâ, describing an event or situation, and has the word â{word}â. Make sure to add a newline to the end of this first sentence 2. The second sentence is said by the child (only include the speech itself, donât include âthe child sayâ, etc.), and the word â{word}â also appears in the sentence said by the child. Do not add quote marks either 3. Print each sentence on one line. Do not include anything else. 4. Each sentence should be short, less than 10 words. 5. The word â{word}â in both sentence have the same meaning and have a clear indication or an implication relationship. 6. â{word}â should not appear at the first/second word of each sentence. Generate 3 pairs of such sentences, so there should be 6 lines in total. You should not add a number. For each line, just print out the sentence.
In visual dialogue (caption version and VLM version), we pre-define 10 sets of templates for each version:
Prompt Templates for Visual Dialogue (Caption Version)
this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> what:<LAN> is:<LAN> it:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> what:<LAN> do:<LAN> you:<LAN> call:<LAN> this:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> can:<LAN> you:<LAN> name:<LAN> this:<LAN> object:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> whatâs:<LAN> this:<LAN> called:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> what:<LAN> this:<LAN> thing:<LAN> is:<LAN> <A> (predict [FILLER]:<LAN>)
Prompt Templates for Visual Dialogue (Caption Version) (continued)
this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> what:<LAN> would:<LAN> you:<LAN> name:<LAN> this:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> whatâs:<LAN> the:<LAN> name:<LAN> of:<LAN> this:<LAN> item:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> how:<LAN> do:<LAN> you:<LAN> identify:<LAN> this:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> what:<LAN> do:<LAN> we:<LAN> have:<LAN> here:<LAN> <A> (predict [FILLER]:<LAN>) this:<ENV> is:<ENV> [FILLER]:<ENV> <Q> how:<LAN> do:<LAN> you:<LAN> call:<LAN> this:<LAN> object:<LAN> <A> (predict [FILLER]:<LAN>)
Prompt Templates for Visual Dialogue (VLM Version)
â<image> \nwhat is it ?â, â<image> \nwhat do you call this ?â, â<image> \ncan you name this object ?â, â<image> \nwhat is this called ?â, â<image> \nwhat this thing is ?â, â<image> \nwhat would you name this ?â, â<image> \nwhat is the name of this item ?â, â<image> \nhow do you identify this ?â, â<image> \nwhat do we have here ?â, â<image> \nhow do you call this object ?â
A.2 Word Lists
CHILDES and Visual Dialog (Text Only). [box, book, ball, hand, paper, table, toy, head, car, chair, room, picture, doll, cup, towel, door, mouth, camera, duck, face, truck, bottle, puzzle, bird, tape, finger, bucket, block, stick, elephant, hat, bed, arm, dog, kitchen, spoon, hair, blanket, horse, tray, train, cow, foot, couch, necklace, cookie, plate, telephone, window, brush, ear, pig, purse, hammer, cat, shoulder, garage, button, monkey, pencil, shoe, drawer, leg, bear, milk, egg, bowl, juice, ladder, basket, coffee, bus, food, apple, bench, sheep, airplane, comb, bread, eye, animal, knee, shirt, cracker, glass, light, game, cheese, sofa, giraffe, turtle, stove, clock, star, refrigerator, banana, napkin, bunny, farm, money]
Visual Dialog (VLM). [box, book, table, toy, car, chair, doll, door, camera, duck, truck, bottle, bird, elephant, hat, bed, dog, spoon, horse, train, couch, necklace, cookie, plate, telephone, window, pig, cat, monkey, drawer, bear, milk, egg, bowl, juice, ladder, bus, food, apple, sheep, bread, animal, shirt, cheese, giraffe, clock, refrigerator, accordion, aircraft, alpaca, ambulance, ant, antelope, backpack, bagel, balloon, barrel, bathtub, beard, bee, beer, beetle, bicycle, bidet, billboard, boat, bookcase, boot, boy, broccoli, building, bull, burrito, bust, butterfly, cabbage, cabinetry, cake, camel, canary, candle, candy, cannon, canoe, carrot, cart, castle, caterpillar, cattle, cello, cheetah, chicken, chopsticks, closet, clothing, coat, cocktail, coffeemaker, coin, cosmetics]
Appendix B Implementation Details
We outline the key implementation details in this section and provide links to the GitHub repositories:
- Model Training: https://github.com/Mars-tin/TraBank
- CHILDES Processing: https://github.com/Mars-tin/PyChildes
B.1 Checkpointing
We save 33 checkpoints in total for text-only experiments and 16 checkpoints for the VLM setting.
CHILDES and Visual Dialog (Text Only). We save the intermediate steps: [0, 150, 300, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000]
Visual Dialog (VLM). We save the intermediate steps: [10000, 20000, 40000, 60000, 80000, 100000, 120000, 140000, 160000, 180000, 200000, 220000, 240000, 260000, 280000, 300000]
B.2 Training details.
For the text-only Transformer, Mamba2, and LSTM models, we randomly initialize them from scratch. The training process is conducted five times, each with a different random seed (using seeds 42, 142, 242, 342, and 442, respectively). The batch size is 16.
For VLM models, we randomly initialize the language model backbone from scratch and keep the DINOv2 vision encoder frozen. The training process is conducted five times for 300k steps, each with a different random seed (using seed 42, 142, 242, 342, and 442, respectively).
All the models use a word-level tokenizer. A list of hyperparameters is shown below:
Transformer and LSTM Model.
- model_max_length: 512
- learning rate: 5e-5
- learning rate schedule: linear
- warmup_steps: 1000
- hidden_size: 768
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0
- batch_size: 16
- grad_clip_norm: 1.0
Mamba2 Model.
- model_max_length: 512
- learning rate: 4e-4
- learning rate schedule: linear
- warmup_steps: 2000
- hidden_size: 768
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0.4
- batch_size: 16
- grad_clip_norm: 1.0
VLM Model.
- model_max_length: 1024
- learning rate: 2e-5
- learning rate schedule: cosine
- warmup_steps: 9000
- hidden_size: 768
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0
- batch_size: 16
- grad_clip_norm: 1.0
B.3 Computational resources.
Each Transformer, Mamba2, and LSTM model is trained on a single A40 GPU within 5 hours. For VLM models, training is conducted on 2 A40 GPUs over 15 hours, using a batch size of 8 per device.
Appendix C Addendum to Results
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion vs. Step
### Overview
This image presents a line chart illustrating the proportion of two processes, "gather" and "aggregate", as a function of "Step". The x-axis represents the step count, ranging from 2k to 20k, while the y-axis represents the proportion, ranging from 0 to 0.6. The chart visually compares the progression of these two processes over the specified steps.
### Components/Axes
* **X-axis Label:** "Step"
* Scale: 2k, 4k, 6k, 8k, 10k, 12k, 14k, 16k, 18k, 20k
* **Y-axis Label:** "Proportion"
* Scale: 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6
* **Legend:** Located in the top-left corner.
* "gather" - represented by a teal/cyan line.
* "aggregate" - represented by an orange/brown line.
### Detailed Analysis
**"gather" Line (Teal/Cyan):**
The "gather" line starts at approximately 0.07 at 2k and exhibits an increasing trend, leveling off around 0.38 at 18k and 20k.
* 2k: ~0.07
* 4k: ~0.14
* 6k: ~0.21
* 8k: ~0.27
* 10k: ~0.32
* 12k: ~0.35
* 14k: ~0.365
* 16k: ~0.375
* 18k: ~0.38
* 20k: ~0.38
**"aggregate" Line (Orange/Brown):**
The "aggregate" line begins at approximately 0.11 at 2k and demonstrates a steeper increasing trend than the "gather" line. It plateaus around 0.60 at 14k, 16k, 18k and 20k.
* 2k: ~0.11
* 4k: ~0.19
* 6k: ~0.28
* 8k: ~0.37
* 10k: ~0.46
* 12k: ~0.53
* 14k: ~0.59
* 16k: ~0.60
* 18k: ~0.60
* 20k: ~0.60
### Key Observations
* The "aggregate" process reaches a higher proportion than the "gather" process throughout the entire range of steps.
* The "aggregate" process exhibits a more rapid increase in proportion during the initial steps (2k to 10k) compared to the "gather" process.
* Both processes appear to converge in their rate of increase as the step count increases, eventually leveling off.
* The "gather" process shows a relatively slow and steady increase, while the "aggregate" process shows a more significant initial growth followed by stabilization.
### Interpretation
The chart suggests that the "aggregate" process is more efficient or faster at reaching a substantial proportion compared to the "gather" process. The initial rapid increase of "aggregate" could indicate a quick initial phase of consolidation or processing. The leveling off of both lines suggests that both processes reach a point of diminishing returns or saturation as the step count increases. The difference in final proportions indicates that "aggregate" ultimately achieves a greater level of completion or contribution within the defined steps. This data could be related to a data processing pipeline where "gather" represents data collection and "aggregate" represents data consolidation or analysis. The chart demonstrates the relative efficiency of these two stages.
</details>
Figure 8: Gather-and-aggregate overtime.
C.1 Behavioral Analysis
We show the complete behavioral evidence for all models in Figure 9, and co-occurrence analysis in Figure 10.
C.2 Mechanistic Analysis
After identifying the set of gather and aggregate heads for each context, we conduct an overtime analysis to determine the proportion of saliency to the total saliency, as illustrated in Figure 8.
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" and the other "Mismatch". The chart appears to track how surprisal changes during a training process.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000. The axis is linearly scaled.
* **Y-axis:** "Surprisal", ranging from approximately 5.0 to 12.5. The axis is linearly scaled.
* **Legend:** Located in the top-right corner of the chart.
* "Match" - represented by a blue line.
* "Mismatch" - represented by an orange line.
### Detailed Analysis
The "Match" line (blue) starts at approximately 7.2 at 0 training steps and decreases steadily to approximately 5.2 at 20000 training steps. The slope is generally consistent, indicating a relatively stable rate of decrease.
The "Mismatch" line (orange) begins at approximately 10.2 at 0 training steps and decreases to approximately 6.8 at 20000 training steps. The initial decrease is steeper than the "Match" line, but the rate of decrease slows down as training progresses.
Here's a breakdown of approximate data points:
**Match (Blue Line):**
* 0 Training Steps: Surprisal â 7.2
* 5000 Training Steps: Surprisal â 6.4
* 10000 Training Steps: Surprisal â 5.8
* 15000 Training Steps: Surprisal â 5.5
* 20000 Training Steps: Surprisal â 5.2
**Mismatch (Orange Line):**
* 0 Training Steps: Surprisal â 10.2
* 5000 Training Steps: Surprisal â 8.5
* 10000 Training Steps: Surprisal â 7.5
* 15000 Training Steps: Surprisal â 7.1
* 20000 Training Steps: Surprisal â 6.8
### Key Observations
* Both "Match" and "Mismatch" exhibit decreasing surprisal as training steps increase. This suggests that the model is learning and becoming more confident in its predictions over time.
* The "Mismatch" line consistently has a higher surprisal value than the "Match" line throughout the entire training process. This indicates that mismatches are inherently more surprising or less predictable than matches.
* The rate of decrease in surprisal slows down for both lines as training progresses, suggesting diminishing returns from further training.
### Interpretation
The chart demonstrates the impact of training on reducing surprisal for both matching and mismatching scenarios. The higher initial surprisal for mismatches suggests that the model initially struggles to differentiate between correct and incorrect pairings. As training progresses, the model learns to better predict matches, leading to a decrease in surprisal for both categories. The fact that surprisal remains higher for mismatches even after extensive training suggests that mismatches are fundamentally more difficult to predict, potentially due to inherent ambiguity or complexity in the data. The diminishing returns observed towards the end of the training process indicate that further training may not yield significant improvements in performance. This could be a signal to stop training and evaluate the model's performance. The chart is a valuable tool for monitoring the learning process and identifying potential areas for improvement.
</details>
(a) 4-layer Transformer.
<details>
<summary>x25.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" and the other "Mismatch" conditions. The chart appears to track the change in surprisal during a training process.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000. The axis is linearly scaled.
* **Y-axis:** "Surprisal", ranging from approximately 4.5 to 12.5. The axis is linearly scaled.
* **Legend:** Located in the top-right corner of the chart.
* "Match" - represented by a blue line.
* "Mismatch" - represented by an orange line.
### Detailed Analysis
The "Match" line (blue) starts at approximately 5.2 and exhibits a generally downward trend, decreasing to approximately 4.7 by 20000 training steps. The slope is initially steeper, then becomes more gradual.
The "Mismatch" line (orange) begins at approximately 11.5 and initially decreases rapidly to around 7.5 by 5000 training steps. After this initial drop, the line plateaus and fluctuates around a value of approximately 7.0, with minor oscillations, until 20000 training steps.
Here's a breakdown of approximate data points:
**Match (Blue Line):**
* 0 Training Steps: ~5.2 Surprisal
* 5000 Training Steps: ~5.0 Surprisal
* 10000 Training Steps: ~4.9 Surprisal
* 15000 Training Steps: ~4.8 Surprisal
* 20000 Training Steps: ~4.7 Surprisal
**Mismatch (Orange Line):**
* 0 Training Steps: ~11.5 Surprisal
* 5000 Training Steps: ~7.5 Surprisal
* 10000 Training Steps: ~7.1 Surprisal
* 15000 Training Steps: ~7.0 Surprisal
* 20000 Training Steps: ~7.0 Surprisal
### Key Observations
* The "Mismatch" condition starts with a significantly higher surprisal value than the "Match" condition.
* Both conditions exhibit a decrease in surprisal over time, indicating learning or adaptation during the training process.
* The "Match" condition shows a more consistent and steady decrease in surprisal.
* The "Mismatch" condition's surprisal decreases rapidly initially, then stabilizes, suggesting a point of diminishing returns or convergence.
### Interpretation
The chart likely represents the surprisal (a measure of how unexpected an event is) of a model's predictions under two conditions: "Match" and "Mismatch". "Match" likely refers to a scenario where the input data aligns with the model's expectations, while "Mismatch" represents a scenario where the input data deviates from those expectations.
The decreasing surprisal for both conditions suggests that the model is learning to better predict the data over time. The higher initial surprisal for the "Mismatch" condition indicates that the model initially finds these scenarios more unexpected. The stabilization of the "Mismatch" surprisal suggests that the model has reached a limit in its ability to handle these deviations, or that the training process has converged on a solution that doesn't further reduce surprisal for mismatched data.
The difference in the final surprisal values between the two conditions suggests that the model is still better at predicting "Match" scenarios than "Mismatch" scenarios, even after 20000 training steps. This could indicate a bias in the training data or a limitation in the model's capacity to generalize to mismatched data.
</details>
(b) 12-layer Transformer.
<details>
<summary>x26.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" and the other "Mismatch" conditions. Both series show a decreasing trend in surprisal as training steps increase, suggesting a learning or adaptation process.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000. The axis is linearly scaled.
* **Y-axis:** "Surprisal", ranging from approximately 5 to 12.5. The axis is linearly scaled.
* **Legend:** Located in the top-right corner of the chart.
* "Match" - represented by a dark blue line.
* "Mismatch" - represented by a orange line.
### Detailed Analysis
**Match (Dark Blue Line):**
The "Match" line starts at approximately 7.2 surprisal at 0 training steps. It exhibits a generally downward trend, with some fluctuations.
* At 0 training steps: ~7.2 surprisal
* At 5000 training steps: ~6.0 surprisal
* At 10000 training steps: ~5.5 surprisal
* At 15000 training steps: ~5.2 surprisal
* At 20000 training steps: ~5.0 surprisal
**Mismatch (Orange Line):**
The "Mismatch" line begins at approximately 11.0 surprisal at 0 training steps. It also shows a decreasing trend, but it plateaus at a higher surprisal level than the "Match" line.
* At 0 training steps: ~11.0 surprisal
* At 5000 training steps: ~7.5 surprisal
* At 10000 training steps: ~7.0 surprisal
* At 15000 training steps: ~6.8 surprisal
* At 20000 training steps: ~6.6 surprisal
### Key Observations
* Both "Match" and "Mismatch" surprisal values decrease with increasing training steps, indicating that the model is learning to better predict or represent the data in both conditions.
* The "Mismatch" condition consistently exhibits higher surprisal values than the "Match" condition across all training steps. This suggests that the model finds the "Mismatch" condition more unexpected or difficult to predict.
* The rate of decrease in surprisal appears to slow down as training progresses for both conditions, indicating diminishing returns from further training.
### Interpretation
This chart likely represents the training process of a model designed to distinguish between "Match" and "Mismatch" conditions. "Surprisal" can be interpreted as a measure of how unexpected or unlikely the model finds a particular input. The decreasing surprisal values suggest that the model is becoming more confident in its predictions as it is exposed to more training data.
The consistently higher surprisal for the "Mismatch" condition indicates that the model struggles more with this type of input. This could be due to several factors, such as the "Mismatch" condition being inherently more complex, the training data being biased towards the "Match" condition, or the model architecture being less suited to handle "Mismatch" inputs.
The plateauing of the surprisal curves suggests that the model is approaching its maximum performance level and that further training may not yield significant improvements. This could be a signal to stop training and evaluate the model's performance on a held-out test set.
</details>
(c) 18-layer Transformer.
<details>
<summary>x27.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Training Steps
### Overview
This image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" and the other "Mismatch". The chart appears to track the surprisal of these two conditions during a training process.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000. The axis is linearly scaled.
* **Y-axis:** "Surprisal", ranging from approximately 4.5 to 12.5. The axis is linearly scaled.
* **Legend:** Located in the top-right corner.
* "Match" - represented by a blue line.
* "Mismatch" - represented by an orange line.
### Detailed Analysis
**Match (Blue Line):**
The blue line representing "Match" starts at approximately 6.0 at 0 training steps. It exhibits a steep downward trend initially, reaching a minimum of approximately 4.2 at around 5000 training steps. After this point, the line plateaus and fluctuates between approximately 4.2 and 5.5, with a slight upward trend towards the end of the observed training steps, reaching approximately 5.3 at 20000 steps.
**Mismatch (Orange Line):**
The orange line representing "Mismatch" begins at approximately 7.5 at 0 training steps. It initially decreases to a minimum of approximately 6.8 at around 2000 training steps. Subsequently, the line increases steadily, with some fluctuations, reaching approximately 8.5 at 20000 training steps.
**Data Points (Approximate):**
| Training Steps | Match (Surprisal) | Mismatch (Surprisal) |
|---|---|---|
| 0 | 6.0 | 7.5 |
| 2000 | ~5.0 | 6.8 |
| 5000 | 4.2 | ~7.2 |
| 10000 | ~4.8 | ~7.8 |
| 20000 | 5.3 | 8.5 |
### Key Observations
* The "Match" condition consistently exhibits lower surprisal values than the "Mismatch" condition throughout the training process.
* The surprisal for "Match" decreases rapidly during the initial training phase and then stabilizes.
* The surprisal for "Mismatch" increases steadily throughout the training process.
* The gap between the surprisal values of "Match" and "Mismatch" widens as training progresses.
### Interpretation
The chart suggests that the training process is successfully reducing the surprisal associated with the "Match" condition, indicating that the model is learning to better predict or recognize matching instances. Conversely, the increasing surprisal for the "Mismatch" condition suggests that the model is becoming more sensitive to discrepancies or non-matching instances. The widening gap between the two conditions implies that the model is effectively differentiating between matching and mismatching data points as training progresses. This could indicate successful learning of a discrimination task. The initial rapid decrease in "Match" surprisal suggests a period of fast learning, followed by a refinement phase where the model's performance plateaus. The consistent increase in "Mismatch" surprisal suggests that the model is continually challenged by non-matching data, leading to ongoing adjustments and learning.
</details>
(d) 12-layer Mamba 2.
<details>
<summary>x28.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" and the other "Mismatch" conditions. The chart appears to track the evolution of surprisal during a training process.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000. The axis is linearly scaled.
* **Y-axis:** "Surprisal", ranging from approximately 4.5 to 12.5. The axis is linearly scaled.
* **Legend:** Located in the top-right corner of the chart.
* "Match" - represented by a dark blue line.
* "Mismatch" - represented by a golden-yellow line.
### Detailed Analysis
**Match (Dark Blue Line):**
The "Match" line begins at approximately 5.2 and exhibits a steep downward trend initially, decreasing rapidly to a minimum of around 4.6 at approximately 5000 training steps. After this initial drop, the line fluctuates around a value of approximately 4.6-5.0, with minor oscillations, until 20000 training steps.
**Mismatch (Golden-Yellow Line):**
The "Mismatch" line starts at approximately 7.7 and shows a slight decreasing trend initially, leveling off to a relatively stable value around 7.5-7.8. There are minor fluctuations throughout the training process, but the overall trend is relatively flat.
**Data Points (Approximate):**
| Training Steps | Match Surprisal | Mismatch Surprisal |
|----------------|-----------------|--------------------|
| 0 | 5.2 | 7.7 |
| 5000 | 4.6 | 7.6 |
| 10000 | 4.8 | 7.7 |
| 15000 | 4.7 | 7.6 |
| 20000 | 4.9 | 7.8 |
### Key Observations
* The "Match" condition exhibits a significant decrease in surprisal during the initial training phase, suggesting rapid learning or adaptation.
* The "Mismatch" condition maintains a relatively constant level of surprisal throughout the training process, indicating limited learning or adaptation.
* The "Match" surprisal consistently remains lower than the "Mismatch" surprisal across all training steps.
* The difference in surprisal between the two conditions appears to remain relatively constant after the initial drop in the "Match" condition.
### Interpretation
The chart suggests that the training process is more effective when there is a "Match" between the input and the expected output. The rapid decrease in surprisal for the "Match" condition indicates that the model is quickly learning to predict or represent the matched data. Conversely, the stable surprisal for the "Mismatch" condition suggests that the model is struggling to learn from mismatched data, potentially due to inherent inconsistencies or difficulties in the learning task. The consistent difference in surprisal between the two conditions highlights the importance of data quality and alignment in the training process. The chart could be illustrating the performance of a model trained on correctly paired data versus incorrectly paired data. The model learns quickly when the data is a "Match" and fails to learn when the data is a "Mismatch".
</details>
(e) 4-layer Mamba 2.
<details>
<summary>x29.png Details</summary>

### Visual Description
\n
## Line Chart: Surprisal vs. Training Steps
### Overview
The image presents a line chart illustrating the relationship between "Surprisal" (y-axis) and "Training steps" (x-axis). Two data series are plotted: one representing "Match" and the other "Mismatch". The chart appears to track the surprisal during a training process, potentially in a machine learning context.
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000. The axis is linearly scaled.
* **Y-axis:** "Surprisal", ranging from approximately 5.0 to 12.5. The axis is linearly scaled.
* **Legend:** Located in the top-right corner of the chart.
* "Match" - represented by a blue line.
* "Mismatch" - represented by an orange line.
### Detailed Analysis
The "Match" line (blue) starts at approximately 7.2 and exhibits a slow, relatively consistent downward trend, leveling off around a surprisal value of 7.0 by 20000 training steps.
The "Mismatch" line (orange) begins at approximately 12.3 and initially decreases more rapidly than the "Match" line. It reaches a minimum around 8000 training steps, with a surprisal value of approximately 7.5. After this point, the "Mismatch" line continues to decrease, but at a much slower rate, approaching a value of approximately 7.2 by 20000 training steps.
Here's a breakdown of approximate data points:
**Match (Blue Line):**
* 0 Training Steps: ~7.2 Surprisal
* 5000 Training Steps: ~7.1 Surprisal
* 10000 Training Steps: ~7.0 Surprisal
* 15000 Training Steps: ~7.0 Surprisal
* 20000 Training Steps: ~7.0 Surprisal
**Mismatch (Orange Line):**
* 0 Training Steps: ~12.3 Surprisal
* 5000 Training Steps: ~8.5 Surprisal
* 10000 Training Steps: ~7.5 Surprisal
* 15000 Training Steps: ~7.3 Surprisal
* 20000 Training Steps: ~7.2 Surprisal
### Key Observations
* Both "Match" and "Mismatch" lines demonstrate a decreasing trend in surprisal as training steps increase, indicating that the model is learning and becoming more confident in its predictions.
* The "Mismatch" line starts with a significantly higher surprisal value than the "Match" line, suggesting that initial mismatches are more surprising or less expected.
* The rate of decrease in surprisal is higher for the "Mismatch" line initially, but it slows down over time, eventually converging towards the "Match" line's surprisal level.
* The "Mismatch" line never falls below the "Match" line, suggesting that mismatches consistently result in higher surprisal values throughout the training process.
### Interpretation
This chart likely represents the surprisal (or perplexity) of a model during training, potentially in a language modeling or sequence prediction task. "Match" could represent the surprisal when the model correctly predicts the next element in a sequence, while "Mismatch" represents the surprisal when the model makes an incorrect prediction.
The decreasing trend in both lines indicates that the model is learning to better predict the sequence, reducing its surprisal over time. The higher initial surprisal for mismatches suggests that the model initially struggles to handle incorrect predictions. The convergence of the two lines towards the end of the training process suggests that the model is becoming more consistent in its predictions, and the difference between correct and incorrect predictions is diminishing.
The fact that the "Mismatch" line remains above the "Match" line throughout the training process suggests that the model never fully eliminates its surprise when encountering mismatches, which is expected. This could be due to inherent noise in the data or limitations in the model's capacity. The chart provides insights into the learning dynamics of the model and its ability to handle both correct and incorrect predictions.
</details>
(f) 4-layer LSTM.
Figure 9: Average surprisal of the experimental and control conditions over training steps.
<details>
<summary>x30.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
The image presents a line chart illustrating the relationship between training steps and two performance metrics: Information Gain and R² value. The chart displays how these metrics evolve during the training process, likely of a machine learning model. The chart uses a dual y-axis to accommodate the different scales of the two metrics.
### Components/Axes
* **X-axis:** "Training steps" ranging from approximately 0 to 20000.
* **Left Y-axis:** "R² values" ranging from 0 to 0.8.
* **Right Y-axis:** "Information gain" ranging from 0 to 6.
* **Legend:** Located in the top-left corner, identifying two lines:
* "Information gain" (dark blue line)
* "R² value" (orange line)
### Detailed Analysis
**Information Gain (Dark Blue Line):**
The Information Gain line starts at approximately 0 at 0 training steps. It exhibits a generally upward trend, increasing at a decreasing rate, and plateaus around a value of approximately 2.3 at 20000 training steps. The line is relatively smooth with no significant oscillations.
* At 0 training steps: ~0
* At 2000 training steps: ~0.6
* At 5000 training steps: ~1.3
* At 10000 training steps: ~1.8
* At 15000 training steps: ~2.1
* At 20000 training steps: ~2.3
**R² Value (Orange Line):**
The R² value line begins at approximately 0 at 0 training steps. It rapidly increases to a peak of approximately 0.4 at around 2000-3000 training steps. After the peak, it declines, oscillating between approximately 0.15 and 0.25, and ends at approximately 0.15 at 20000 training steps.
* At 0 training steps: ~0
* At 2000 training steps: ~0.38
* At 4000 training steps: ~0.3
* At 6000 training steps: ~0.25
* At 8000 training steps: ~0.2
* At 10000 training steps: ~0.18
* At 12000 training steps: ~0.22
* At 14000 training steps: ~0.17
* At 16000 training steps: ~0.19
* At 20000 training steps: ~0.15
### Key Observations
* The R² value initially increases rapidly but then decreases and stabilizes at a relatively low value. This suggests that the model's ability to explain the variance in the data improves initially but then plateaus or even degrades.
* The Information Gain consistently increases throughout the training process, indicating that the model is continuously learning and gaining information from the data.
* The two metrics exhibit contrasting trends. While Information Gain continues to rise, R² value plateaus and declines, suggesting a potential trade-off between model complexity and its ability to generalize.
### Interpretation
The chart suggests that the training process leads to increasing information gain, but the model's ability to fit the training data (as measured by R²) plateaus and eventually declines. This could indicate overfitting, where the model learns the training data too well and loses its ability to generalize to unseen data. The initial rapid increase in R² suggests a period of rapid learning, followed by a period where the model's performance on the training data plateaus. The continued increase in Information Gain suggests that the model is still learning, but this learning may not be translating into improved performance on the training data. Further investigation would be needed to determine the cause of the decline in R² value and to assess the model's generalization performance on a validation set. The divergence between the two metrics is a key observation, hinting at a potential issue with the training process or model architecture.
</details>
(a) 4-layer Transformer.
<details>
<summary>x31.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the relationship between training steps and two performance metrics: Information Gain and R² value. The chart displays how these metrics evolve during the training process, likely of a machine learning model. The x-axis represents "Training steps", while the left y-axis represents "R² values" and the right y-axis represents "Information gain".
### Components/Axes
* **X-axis:** "Training steps", ranging from approximately 0 to 20000.
* **Left Y-axis:** "R² values", ranging from 0.0 to 0.8.
* **Right Y-axis:** "Information gain", ranging from 0 to 6.
* **Legend:** Located in the top-left corner, containing two entries:
* "Information gain" - represented by a dark blue line.
* "R² value" - represented by a light orange line.
### Detailed Analysis
* **Information Gain (Blue Line):** The blue line representing Information Gain exhibits an upward trend throughout the training process. It starts at approximately 0 at 0 training steps, gradually increases, and plateaus around a value of approximately 2.5 at 20000 training steps. The slope of the line decreases as training progresses, indicating diminishing returns in information gain.
* **R² Value (Orange Line):** The orange line representing the R² value shows a more complex pattern. It begins at approximately 0 at 0 training steps, rapidly increases to a peak of around 0.45 at approximately 5000 training steps, and then gradually decreases to approximately 0.25 at 20000 training steps. This suggests an initial period of rapid model improvement followed by a potential overfitting or stabilization phase.
Here's a breakdown of approximate data points:
| Training Steps | Information Gain (approx.) | R² Value (approx.) |
|---|---|---|
| 0 | 0.0 | 0.0 |
| 2500 | 0.8 | 0.2 |
| 5000 | 1.5 | 0.45 |
| 10000 | 2.0 | 0.35 |
| 20000 | 2.5 | 0.25 |
### Key Observations
* The R² value peaks early in the training process and then declines, while the Information Gain continues to increase, albeit at a decreasing rate.
* The initial rapid increase in R² suggests a quick learning phase.
* The plateauing of Information Gain indicates that the model is extracting less and less new information from the training data as training progresses.
* The divergence between the two metrics after the initial phase suggests a potential trade-off between model fit (R²) and information gain.
### Interpretation
The chart suggests that the model initially learns quickly, as evidenced by the rapid increase in the R² value. However, this improvement plateaus and even declines, potentially indicating overfitting or reaching a point of diminishing returns. The continuous increase in Information Gain suggests that the model is still processing and incorporating information from the training data, even as its ability to fit the data (as measured by R²) decreases.
This could indicate that the model is becoming more complex and potentially memorizing the training data rather than generalizing to unseen data. Further investigation might involve techniques to prevent overfitting, such as regularization or early stopping. The relationship between Information Gain and R² is not necessarily causal, but their contrasting trends provide valuable insights into the model's learning dynamics. The chart highlights the importance of monitoring multiple metrics during training to gain a comprehensive understanding of model performance.
</details>
(b) 12-layer Transformer.
<details>
<summary>x32.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the relationship between training steps and two performance metrics: Information Gain and R² value. The chart tracks these metrics during a training process, likely for a machine learning model. The x-axis represents the number of training steps, while the left y-axis represents the R² value and the right y-axis represents the Information Gain.
### Components/Axes
* **X-axis:** "Training steps" ranging from 0 to approximately 20000.
* **Left Y-axis:** "R² values" ranging from 0.0 to 0.8.
* **Right Y-axis:** "Information gain" ranging from 0 to 6.
* **Legend:** Located in the top-left corner, identifying two data series:
* "Information gain" â represented by a dark blue line.
* "R² value" â represented by an orange line.
### Detailed Analysis
**Information Gain (Dark Blue Line):**
The Information Gain line starts at approximately 0 at 0 training steps. It exhibits a generally upward trend, increasing at a decreasing rate.
* At 0 training steps: ~0.0
* At 5000 training steps: ~1.5
* At 10000 training steps: ~2.0
* At 15000 training steps: ~2.4
* At 20000 training steps: ~2.7
**R² Value (Orange Line):**
The R² value line begins at approximately 0 at 0 training steps. It initially increases rapidly, reaching a peak around 5000 training steps, then gradually declines and plateaus.
* At 0 training steps: ~0.0
* At 2500 training steps: ~0.3
* At 5000 training steps: ~0.43 (peak)
* At 7500 training steps: ~0.35
* At 10000 training steps: ~0.25
* At 15000 training steps: ~0.15
* At 20000 training steps: ~0.1
### Key Observations
* The Information Gain consistently increases with training steps, suggesting the model is continually learning and improving its ability to extract relevant information.
* The R² value initially increases, indicating improved model fit, but then decreases, suggesting overfitting or diminishing returns from further training.
* The peak R² value is significantly higher than the final R² value, indicating that the model performed better earlier in the training process.
* The scales on the y-axes are different, which is important to note when comparing the magnitudes of the two metrics.
### Interpretation
The chart suggests that while the model continues to gain information as training progresses, its ability to generalize to unseen data (as measured by R²) plateaus and eventually declines. This could indicate that the model is starting to overfit the training data. The initial rapid increase in R² followed by a decline is a common pattern in machine learning, and it highlights the importance of monitoring both training performance (Information Gain) and generalization performance (R² value) to determine the optimal stopping point for training. The divergence between the two lines after approximately 7500 training steps is a key indicator that further training may not be beneficial. The model is learning, but not necessarily improving its predictive power on new data. A potential next step would be to implement regularization techniques or early stopping to prevent overfitting and improve the model's generalization ability.
</details>
(c) 18-layer Transformer.
<details>
<summary>x33.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the relationship between training steps and two performance metrics: Information Gain and R² value. The chart tracks these metrics during a training process, likely for a machine learning model. The x-axis represents the number of training steps, while the left y-axis represents the R² value and the right y-axis represents the Information Gain.
### Components/Axes
* **X-axis:** "Training steps" ranging from approximately 0 to 20000.
* **Left Y-axis:** "R² values" ranging from 0.0 to 0.8.
* **Right Y-axis:** "Information gain" ranging from 0 to 6.
* **Legend:** Located in the top-right corner, containing two entries:
* "Information gain" - represented by a dark blue line.
* "R² value" - represented by an orange line.
### Detailed Analysis
The chart displays two distinct lines representing the two metrics.
**Information Gain (Dark Blue Line):**
The line initially rises sharply from approximately 0 at 0 training steps, reaching a value of around 2 at approximately 2000 training steps. It then plateaus with some fluctuations, reaching a maximum value of approximately 4.4 at around 12000 training steps. The line continues to fluctuate between approximately 4.0 and 4.4 until 20000 training steps.
**R² Value (Orange Line):**
The line starts at approximately 0 at 0 training steps and increases rapidly to a peak of around 0.25 at approximately 500 training steps. It then declines to a minimum of approximately 0.05 at around 1500 training steps. After this decline, the line gradually increases, reaching a value of approximately 0.15 at 20000 training steps. The R² value exhibits significant oscillation throughout the training process.
Approximate Data Points:
| Training Steps | Information Gain | R² Value |
|---|---|---|
| 0 | 0 | 0 |
| 2000 | 2 | 0.2 |
| 5000 | 3.2 | 0.15 |
| 10000 | 4.2 | 0.1 |
| 12000 | 4.4 | 0.08 |
| 20000 | 4.1 | 0.15 |
### Key Observations
* Information Gain increases initially and then stabilizes, suggesting the model is learning and extracting useful information from the data.
* The R² value shows an initial increase, followed by a decrease and then a slow increase, indicating that the model's ability to explain the variance in the data fluctuates during training.
* The R² value remains relatively low throughout the training process, suggesting that the model does not explain a large proportion of the variance in the data.
* The Information Gain and R² value do not appear to be strongly correlated.
### Interpretation
The chart suggests that while the model is gaining information during training (as indicated by the increasing Information Gain), its ability to fit the data (as indicated by the R² value) is limited. The initial rapid increase in both metrics suggests a period of fast learning. The subsequent stabilization of Information Gain and fluctuating R² value could indicate that the model is reaching a point of diminishing returns, or that the data is inherently noisy or complex. The low R² value suggests that the model may not be a good fit for the data, or that additional features or a different model architecture may be needed. The divergence between the two metrics suggests that the information being gained isn't necessarily translating into improved model fit. This could be due to overfitting, or the presence of irrelevant features. Further investigation is needed to understand the reasons for the low R² value and the divergence between the two metrics.
</details>
(d) 12-layer Mamba 2.
<details>
<summary>x34.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the training performance of a model, tracking both Information Gain and R² value over Training Steps. The chart displays two distinct curves, each with a shaded region representing uncertainty or variance. The chart is designed to show how these metrics evolve during the training process.
### Components/Axes
* **X-axis:** "Training steps" ranging from approximately 0 to 20000.
* **Left Y-axis:** "R² values" ranging from 0 to 0.8.
* **Right Y-axis:** "Information gain" ranging from 0 to 6.
* **Line 1 (Blue):** "Information gain" with a shaded region.
* **Line 2 (Orange):** "R² value" with a shaded region.
* **Legend:** Located in the top-right corner, clearly labeling each line.
### Detailed Analysis
**Information Gain (Blue Line):**
The blue line representing Information Gain starts at approximately 0.5 at 0 training steps and exhibits an upward trend, reaching a peak of around 4.5 at approximately 12000 training steps. After this peak, the line plateaus and fluctuates between approximately 4.0 and 4.3 until 20000 training steps. The shaded region around the line indicates variability, with the lower bound starting around 0.2 at 0 training steps and rising to approximately 3.5 at 12000 steps, then remaining around 3.7-4.0.
**R² Value (Orange Line):**
The orange line representing the R² value begins at approximately 0 at 0 training steps and rapidly increases to a peak of around 0.25 at approximately 5000 training steps. Following this peak, the line declines steadily, reaching a value of approximately 0.05 at 20000 training steps. The shaded region around the line shows variability, starting at approximately 0 at 0 training steps, peaking around 0.3 at 5000 steps, and then decreasing to approximately 0.03 at 20000 steps.
### Key Observations
* Information Gain increases initially and then plateaus, suggesting the model is learning and then reaching a point of diminishing returns.
* R² value increases initially, indicating improved model fit, but then decreases, suggesting overfitting or a loss of generalization ability.
* The shaded regions indicate a degree of uncertainty or variance in the metrics, which is expected during training.
* The two metrics exhibit inverse behavior after approximately 5000 training steps. As Information Gain plateaus, R² value begins to decline.
### Interpretation
The chart suggests a training process where the model initially gains information and improves its fit to the training data (as indicated by the increasing Information Gain and R² value). However, after a certain point (around 5000-12000 training steps), the model appears to start overfitting or losing its ability to generalize, as evidenced by the plateauing Information Gain and decreasing R² value. This could indicate the need for regularization techniques, early stopping, or a more complex model architecture. The inverse relationship between the two metrics after the initial phase is a key observation, suggesting a trade-off between model complexity and generalization ability. The shaded regions highlight the inherent stochasticity of the training process and the potential for variations in performance.
</details>
(e) 4-layer Mamba 2.
<details>
<summary>x35.png Details</summary>

### Visual Description
\n
## Line Chart: Training Performance Metrics
### Overview
This image presents a line chart illustrating the progression of two key metrics â Information Gain and R² value â during a training process, likely for a machine learning model. The x-axis represents "Training steps," while the left y-axis represents "R² values" and the right y-axis represents "Information gain." The chart displays how these metrics change as the model undergoes training.
### Components/Axes
* **X-axis:** "Training steps" ranging from approximately 0 to 20000.
* **Left Y-axis:** "R² values" ranging from 0.0 to 0.8.
* **Right Y-axis:** "Information gain" ranging from 0 to 6.
* **Legend:** Located in the top-left corner, identifying the two lines:
* "Information gain" (Blue line)
* "R² value" (Orange line)
### Detailed Analysis
* **Information Gain (Blue Line):** The blue line representing Information Gain starts at approximately 0 at 0 training steps. It exhibits a slow, relatively linear increase throughout the training process, reaching a value of approximately 1.2 at 20000 training steps.
* **R² Value (Orange Line):** The orange line representing the R² value begins at approximately 0 at 0 training steps. It shows a rapid initial increase, leveling off as training progresses. At around 5000 training steps, the R² value reaches approximately 0.4. It continues to increase, but at a diminishing rate, reaching approximately 0.52 at 20000 training steps.
### Key Observations
* The R² value demonstrates a significant initial improvement during the first 5000 training steps, suggesting rapid learning in the early stages.
* The Information Gain increases steadily throughout the entire training process, indicating a consistent gain in knowledge or model refinement.
* The rate of improvement for the R² value decreases substantially after 5000 training steps, suggesting diminishing returns from further training.
* The Information Gain and R² value are plotted on different scales, making direct comparison challenging.
### Interpretation
The chart suggests that the model is learning effectively, as evidenced by the increasing R² value and Information Gain. The initial rapid increase in R² indicates that the model quickly captures the essential patterns in the training data. However, the diminishing rate of improvement in R² suggests that the model may be approaching its maximum performance potential or that the training data may not contain sufficient information for further significant gains. The consistent increase in Information Gain suggests that the model continues to refine its understanding of the data, even as the R² value plateaus. This could indicate that the model is learning more subtle or complex relationships that do not necessarily translate into a higher R² value. The difference in scales between the two metrics makes it difficult to determine if there is a direct correlation between the two. Further analysis might involve examining the training and validation curves to assess for overfitting.
</details>
(f) 4-layer LSTM.
Figure 10: Grounding information gain and its correlation to the co-occurrence of linguistic and environment tokens over training steps.