# Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics
**Authors**:
- Kavin R V (Indian Institute of Technology)
- Kharagpur, WB, India
- &Pawan Goyal (Indian Institute of Technology)
- Kharagpur, WB, India
## Abstract
Standard language models employ unique, monolithic embeddings for each token, potentially limiting their ability to capture the multifaceted nature of word meanings. We investigate whether tokens can be more effectively represented through a compositional structure that accumulates diverse semantic facets. To explore this, we propose Aggregate Semantic Grouping (ASG), a novel approach leveraging Product Quantization (PQ). We apply ASG to standard transformer architectures (mBERT, XLM-R, mT5) and evaluate this representational scheme across diverse tasks (NLI, NER, QA), as well as a biomedical domain-specific benchmark (BC5CDR) using BioBERT. Our findings demonstrate that representing tokens compositionally via ASG achieves extreme compression in embedding parameters (0.4–0.5%) while maintaining $>$ 95% task performance relative to the base model, even in generative tasks and extends to both cross lingual transfer and domain-specific settings. These results validate the principle that tokens can be effectively modeled as combinations of shared semantic building blocks. ASG offers a simple yet concrete method for achieving this, showcasing how compositional representations can capture linguistic richness while enabling compact yet semantically rich models.
Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics
Kavin R V Indian Institute of Technology Kharagpur, WB, India kavinrv13@gmail.com Pawan Goyal Indian Institute of Technology Kharagpur, WB, India pawang@cse.iitkgp.ac.in
## 1 Introduction
In modern language models, each token is typically represented by an individual, unique embedding. However, this approach may not be optimal, as semantically similar tokens (e.g., "mother," "mom," and their respective translations in different languages) can be assigned entirely distinct representations, potentially overlooking shared conceptual underpinnings. Recent works (Park et al., 2023, 2024; Shani et al., 2025) suggests that token representations in LLMs implicitly encode higher-level semantic regularities, often described as concepts, which may be shared across words or subwords. While these studies analyze such concepts as emergent semantic categories or directions in representation space, our work explores an explicit, compositional formulation where tokens are represented as sequences of shared Concept Vectors. In parallel, Zhang et al. (2024) proposed concept-level representations, grouping semantically similar tokens, using k-means. While this method achieved significant vocabulary compression with retained performance, it struggles with polysemy (e.g., "father" as family vs. religious figure) and is limited to encoder-only models, hindered by not explicitly predicting subword in autoregressive decoding.
<details>
<summary>images/main.png Details</summary>

### Visual Description
## Diagram: Concept Vector Decomposition and Recombination for Word Embeddings
### Overview
This technical diagram illustrates a process for transforming word embeddings into a set of "concept vectors" and then using a subset of those concepts to construct a new, modified input embedding for a specific token. The flow moves from left to right, showing decomposition, clustering, and recombination.
### Components/Axes
The diagram is segmented into three primary regions:
1. **Left Region: Word Embeddings Matrix**
* A grid labeled "Word Embeddings".
* **Rows:** Labeled `W1` through `W10` (representing 10 distinct word tokens).
* **Columns:** Grouped into three vertical segments, indicated by dashed colored boxes and labels at the top:
* `Segment 1` (Blue dashed box)
* `Segment 2` (Red dashed box)
* `Segment 3` (Green dashed box)
* Each cell in the matrix is filled with a unique color, representing the embedding values for a given word token across different dimensions or features.
2. **Middle Region: Concept Vector Clusters**
* Three separate 2D coordinate systems (plots), each with unlabeled X and Y axes (implied to represent a conceptual or latent space).
* Each plot contains clusters of colored points, representing "concepts" derived from a segment of the word embeddings.
* **Top Plot (from Segment 1):** Contains clusters labeled `c1` (green), `c2` (red), `c3` (teal), `c4` (purple).
* **Middle Plot (from Segment 2):** Contains clusters labeled `c5` (light green), `c6` (pink), `c7` (blue), `c8` (orange).
* **Bottom Plot (from Segment 3):** Contains clusters labeled `c9` (olive), `c10` (brown), `c11` (magenta), `c12` (dark green).
* To the right of these plots is a legend-like structure titled **"New Embedding (Concept Vectors)"**. It shows stacked colored bars corresponding to the concept labels:
* `c1` (green bar), `c2` (red bar), `c3` (teal bar), `c4` (purple bar)
* `c5` (light green bar), `c6` (pink bar), `c7` (blue bar), `c8` (orange bar)
* `c9` (olive bar), `c10` (brown bar), `c11` (magenta bar), `c12` (dark green bar)
3. **Right Region: Embedding Recombination**
* **Top Element:** A solid, light red horizontal bar labeled **"Old Input embedding for token w6"**. Below it, the label `W6` is written in a matching light red color.
* **Bottom Element:** A new, multi-colored horizontal bar labeled **"New Input embedding for token w6 (c2, c8, c10)"**. This bar is segmented into three colored sections:
* Left section: Red (matching `c2`).
* Middle section: Orange (matching `c8`).
* Right section: Brown (matching `c10`).
* **Flow Arrows:**
* A dashed blue arrow originates from `Segment 1` of the Word Embeddings matrix and points to the top Concept Vector plot.
* A dashed red arrow originates from `Segment 2` and points to the middle Concept Vector plot.
* A dashed green arrow originates from `Segment 3` and points to the bottom Concept Vector plot.
* A dashed red arrow originates from the `c2` bar in the "New Embedding" legend and points to the red section of the "New Input embedding".
* A dashed orange arrow originates from the `c8` bar and points to the orange section of the "New Input embedding".
* A dashed brown arrow originates from the `c10` bar and points to the brown section of the "New Input embedding".
### Detailed Analysis
The process depicted is a multi-stage transformation:
1. **Segmentation & Decomposition:** The original word embedding matrix (W1-W10) is split into three distinct segments (1, 2, 3). Each segment is processed independently.
2. **Concept Extraction:** Each segment's data is projected into a latent space, forming clusters of points. Each cluster is assigned a unique concept identifier (`c1` through `c12`). The color of each cluster in the plots corresponds to the color of its label bar in the "New Embedding (Concept Vectors)" legend.
3. **Concept Selection:** For the specific token `w6`, three concepts are selected: `c2` (from Segment 1), `c8` (from Segment 2), and `c10` (from Segment 3).
4. **Embedding Construction:** A new input embedding for token `w6` is constructed by concatenating the vectors representing the selected concepts (`c2`, `c8`, `c10`). This new embedding is visually distinct from the "Old Input embedding" for the same token, which is shown as a uniform block.
### Key Observations
* **Color Consistency:** The diagram maintains strict color consistency. The color of a concept cluster (e.g., `c2` red cluster) matches its bar in the legend, which in turn matches the segment of the final "New Input embedding" it contributes to.
* **Spatial Flow:** The layout clearly guides the viewer from raw data (left), through an analytical/transformative process (middle), to a synthesized output (right).
* **Selective Recombination:** The new embedding is not a simple average or combination of all concepts. It is a deliberate, sparse selection of specific concepts (`c2`, `c8`, `c10`) from different original segments, suggesting a targeted modification of the token's representation.
* **Dimensionality Implication:** The transition from a multi-column matrix to 2D concept clusters and then to a 1D concatenated vector implies a process of dimensionality reduction followed by feature selection and re-projection.
### Interpretation
This diagram likely illustrates a technique for **interpretable or controllable text representation learning**. The core idea is to decompose a word's embedding (which is often a dense, opaque vector) into a set of human- or model-interpretable "concepts."
* **What it suggests:** The model can disentangle different semantic or syntactic facets of a word (represented by segments and then concepts). By selecting specific concepts (`c2`, `c8`, `c10`), one can construct a new embedding for the same word (`w6`) that emphasizes certain desired properties while suppressing others. For example, `c2` might represent "grammatical number," `c8` might represent "sentiment," and `c10` might represent "topic domain."
* **How elements relate:** The segments in the original embedding are the source material. The concept clusters are the extracted, interpretable features. The final embedding is the application of those features for a specific purpose—perhaps to debias the word representation, adapt it to a specific context, or analyze its semantic components.
* **Notable implication:** The "Old" vs. "New" embedding for `w6` highlights that a word's representation is not fixed. It can be dynamically recomposed based on which underlying concepts are activated or selected, offering a powerful mechanism for fine-grained control in NLP models. The process moves from a holistic representation to an analytical decomposition and finally to a synthetic, purpose-built representation.
</details>
Figure 1: Overview of the Aggregate Semantic Grouping (ASG) method for creating compositional token embeddings. Product Quantization is applied to the original word embedding layer. Embeddings are segmented into $m$ sub-vectors. For each of the $m$ segment positions, k-means clustering is performed on the corresponding sub-vectors from all tokens to learn a codebook of $k$ Concept Vectors (centroids). The new ASG embedding layer containing these learned Concept Vectors is initialized as the embedding layer. Instead of using the original input embedding for a token ‘w’, a sequence of $m$ ConceptIDs used to get their respective Concept Vectors from the ASG layer, these are then concatenated to form the new representation for token ‘w’.
To address these limitations, we introduce Aggregate Semantic Grouping (ASG). ASG maintains concept-level sharing but represents tokens as sequences of ‘conceptIDs’, thereby accumulating multiple semantic facets. This sequence-based representation is inspired by successful applications in information and generative retrieval Wang et al. (2022); Tay et al. (2022); Zhou et al. (2022). We employ Product Quantization (PQ) (Jégou et al., 2011) to transform tokens into these conceptID sequences, aiming to preserve token’s uniqueness and nuances while benefiting from shared semantics.
Our primary contribution is the introduction of Aggregate Semantic Grouping (ASG), a novel method leveraging Product Quantization to represent tokens as sequences of shared ConceptIDs, thereby capturing multiple semantic facets while significantly compressing embedding layer parameters. We provide a detailed methodology for applying ASG to both encoder and encoder–decoder transformer models. Conducting experiments across diverse tasks (NLI, NER, QA) and models (mBERT, XLM-R, mT5), we demonstrate that even with extreme compression on embeddings (down to 0.4–0.5% of the original embedding parameters), ASG maintains high performance (often $>$ 95% relative to baseline) and outperforms the prior semantic grouping method (Zhang et al., 2024), including in zero-shot cross-lingual transfer scenarios. Furthermore, we extend our evaluation to a domain-specific benchmark (BC5CDR; Li et al., 2016) and a domain-specialized model (BioBERT; Lee et al., 2020), where ASG achieves similar robustness, confirming its applicability beyond general-domain tasks. The code will be available at https://github.com/KavinRV/Aggregate-Semantic-Grouping.
## 2 Aggregate Semantic Grouping (ASG)
Our approach, Aggregate Semantic Grouping, reframes token representation by learning compositional embeddings from pre-trained models.
### 2.1 Learning Concept Vectors via Product Quantization
We begin with a pre-trained word embedding matrix $E$ , where each row is a $D$ -dimensional vector for a token in a vocabulary of size $V$ . Using Product Quantization (PQ), each $D$ -dimensional embedding is first divided into $m$ distinct segments (sub-vectors), each of dimension $D/m$ . For each of these $m$ segment positions, we apply k-means clustering to the collection of all corresponding segments from every token in the vocabulary. This process yields $m$ distinct codebooks; each codebook $C_{i}$ (for $i=0,\dots,m-1$ ) contains $k$ centroids, termed Concept Vectors, specific to that segment position. Each Concept Vector is of dimension $D/m$ .
### 2.2 ASG Embedding Layer Initialization
The $m$ distinct codebooks ( $C_{0},C_{1},\dots,C_{m-1}$ ), where each codebook $C_{i}$ contains $k$ Concept Vectors of dimension $D/m$ , are concatenated to form a single, new embedding matrix $E^{\prime}$ . This matrix $E^{\prime}$ has dimensions $(m\times k)\times(D/m)$ and stores all unique Concept Vectors. Specifically, the $j$ -th Concept Vector (where $j\in[0,k-1]$ ) from the $i$ -th codebook $C_{i}$ is located at row $i\times k+j$ within $E^{\prime}$ .
Each token is then mapped to a sequence of $m$ ConceptIDs. For each of its $m$ embedding segments, the corresponding ConceptID is the specific row index in $E^{\prime}$ that stores the chosen Concept Vector for that segment. This row index is determined as $i\times k+s_{i}$ , where $i$ is the segment index (from $0$ to $m-1$ ) and $s_{i}$ is the index (from $0$ to $k-1$ ) of the selected centroid from the $i$ -th segment’s codebook. This sequence of $m$ row indices (ConceptIDs) thus identifies the set of Concept Vectors representing the token.
Table 1: Evaluation results across cluster granularities for mBert and XLM-R on multilingual benchmarks. Scores include F1, Accuracy, and relative performance (%Base). For XNLI %Base is for the accuracy relative to the base model. 40% SG: Semantic Grouping as mentioned in Zhang et al. (2024). In the Zero-Shot setting the models were trained on english dataset and have been tested on all the languages.
| mBERT | 100.00 | 100.00 | 75.46 | 74.79 | 100.00 | 89.74 | 100.00 | 64.86 | 100.00 | 58.58 | 100.00 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| -40% SG | 40.00 | 68.95 | 72.43 | 71.88 | 95.99 | 86.69 | 96.61 | 60.64 | 93.49 | 52.35 | 89.37 |
| -ASG( $k$ =512, $m$ =48) | 0.50 | 48.65 | 73.51 | 72.84 | 97.42 | 88.11 | 98.19 | 61.30 | 94.51 | 55.71 | 95.10 |
| XLM-R | 100.00 | 100.00 | 77.98 | 77.28 | 100.00 | 88.37 | 100.000 | 71.94 | 100.00 | 58.74 | 100.00 |
| -40% SG | 40.00 | 58.48 | 74.56 | 73.96 | 95.61 | 84.57 | 95.70 | 65.83 | 91.51 | 51.48 | 87.65 |
| -ASG( $k$ =1024, $m$ =48) | 0.40 | 31.08 | 77.06 | 76.39 | 98.81 | 86.53 | 97.92 | 68.05 | 67.39 | 54.46 | 92.72 |
### 2.3 Token Representation with ASG
When a token is processed, its pre-computed sequence of $m$ ConceptIDs is used to retrieve the corresponding $m$ Concept Vectors from their respective codebooks within $E^{\prime}$ . Let these retrieved Concept Vectors be $v_{0},v_{1},\dots,v_{m-1}$ , where each $v_{i}$ has dimension $D/m$ . The final ASG representation for the token, $e^{\prime}\in\mathbb{R}^{D}$ , is obtained by concatenating these $m$ Concept Vectors:
$$
e^{\prime}=\text{concat}(v_{0},v_{1},\dots,v_{m-1}) \tag{1}
$$
This vector $e^{\prime}$ serves as the input to subsequent layers of the model.
### 2.4 Application to Generative Models
For model with decoder, which have separate input and output embedding layers (the latter often serving as token classifier weights), we apply the ASG process to both. This results in two distinct ASG embedding structures: one for input token representations ( $E^{\prime}$ ) and another for the output layer ( $OE^{\prime}$ ), each derived from their respective original embedding matrices.
Table 2: Evaluation results for Generative models across cluster granularities for mT5 on TyDiQA and WikiANN. Seperate: 1 codebook per segment, Shared: codebooks shared across all segments, In the Zero-Shot setting the models were trained on English dataset and have been tested on all the languages.
| Model | Parameter Reduced to (%) | TyDIQA | WikiANN | WikiANN (Zero-Shot) | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Embedding | Model | F1 | EM | %Base | F1 | %Base | F1 | %Base | | |
| mT5 | 100.00 | 100.00 | 70.74 | 56.20 | 100.00 | 84.21 | 100.00 | 50.75 | 100.00 | |
| ASG Separate | -( $k=1024,m=32$ ) | 0.45 | 15.06 | 60.67 | 46.15 | 85.76 | 79.85 | 94.82 | 25.84 | 50.91 |
| -( $k=2048,m=32$ ) | 0.85 | 15.41 | 63.81 | 49.06 | 90.19 | 80.93 | 96.11 | 29.85 | 58.82 | |
| -( $k=8192,m=32$ ) | 3.32 | 17.51 | 66.22 | 51.71 | 93.61 | 82.19 | 97.60 | 33.51 | 66.03 | |
| -( $k=1024,m=64$ ) | 0.45 | 15.06 | 69.96 | 55.53 | 98.89 | 83.18 | 98.78 | 44.02 | 86.74 | |
| ASG Shared | -( $k=16384,m=32$ ) | 0.25 | 14.89 | 66.50 | 51.90 | 93.99 | 81.65 | 96.96 | 34.01 | 67.02 |
| -( $k=32768,m=32$ ) | 0.45 | 15.06 | 67.00 | 53.06 | 94.71 | 82.04 | 97.42 | 37.01 | 72.92 | |
| -( $k=32768,m=64$ ) | 0.25 | 14.89 | 70.81 | 56.51 | 100.09 | 84.19 | 99.97 | 47.23 | 93.06 | |
Output Logit Calculation: To compute the logit $l_{t}$ for a target token $t$ , the final hidden state $H\in\mathbb{R}^{D}$ from the model is first segmented into $m$ parts: $H=[H_{0},H_{1},\dots,H_{m-1}]$ , where each $H_{i}\in\mathbb{R}^{D/m}$ . Let the sequence of Concept Vectors for token $t$ be $u_{t,0},u_{t,1},\dots,u_{t,m-1}\in OE^{\prime}$ . The logit is calculated as:
$$
l_{t}=\sum_{i=0}^{m-1}H_{i}\cdot u_{t,i} \tag{2}
$$
## 3 Experiments and Results
### 3.1 Datasets
We evaluate our proposed ASG method on diverse cross-lingual benchmarks for natural language inference (NLI), question answering (QA), and named entity recognition (NER). These include: XNLI (Conneau et al., 2018), a 15-language sentence understanding benchmark; the Gold Passage (GoldP) task of TyDi QA (Clark et al., 2020), an 11-language QA dataset where gold context is provided; and the XTREME benchmark version (Hu et al., 2020) of WikiANN (Pan et al., 2017), a 40-language NER dataset.
### 3.2 Settings
For $k$ , values were generally chosen as powers of two. This allowed us to systematically target specific levels of embedding parameter compression, aiming for reductions that brought the ASG embedding layer size to approximately 0.5%, 1%, and 4% of the original embedding parameters. Regarding the number of subspaces $m$ , our explorations indicated that too few subspaces (e.g., $m=16$ ) resulted in a significant degradation of model performance. Conversely, using very high values for $m$ (e.g., 128, 256, or 512), would lead to extremely small dimensions for each segment ( $D/m$ , potentially as low as 4, 2, or 1 for common embedding sizes $D$ ) and would consequently require very long sequences of ConceptIDs (length $m$ ) to represent each token. These considerations led us to focus on $m$ values within a moderate range for the experiments detailed below.
### 3.3 Fine-tuning Performance
We evaluated ASG on encoder-only mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2019) models using the XNLI and WikiANN datasets, mainly to compare it’s effectiveness against the Semantic grouping as mention in Zhang et al. (2024). As demonstrated in Table 1, ASG achieves significant embedding compression while maintaining over 97% of baseline performance and notably outperforms Semantic Grouping (SG) method, even with a low $k$ value.
To assess ASG for generative tasks, we then evaluated the mT5 model (Xue et al., 2020) on the TyDiQA and WikiANN datasets, applying ASG to both its input and output embeddings. Table 2, detailing results for various cluster ( $k$ ) and subspace ( $m$ ) configurations, shows ASG consistently achieved over 85% of baseline mT5 performance. Specifically, with $k\geq 2048$ , relative performance on TyDiQA surpassed 90%, while on WikiANN, ASG configurations generally exceeded 95% of the baseline.
Furthermore, for mT5, we investigated a variant employing a single shared codebook across all $m$ subspaces. To achieve this, the $m$ segments from all token embeddings in the vocabulary are pooled together before applying k-means clustering. This yields one global codebook of Concept Vectors. Each of the $m$ ConceptIDs for a token then selects a Concept Vector from this single shared codebook to represent its corresponding segment. This shared codebook is then used across all $m$ positions for constructing the token representation. This approach, despite reducing the diversity of available Concept Vectors, impressively maintained over 95% relative performance across both TyDiQA and WikiANN. This suggests that a highly restricted set of output Concept Vectors can still be effective for generative tasks.
### 3.4 Domain-Specific Evaluation (BC5CDR)
| BERT-base -40% SG -ASG ( $k{=}128$ , $m{=}48$ ) | 100 40 0.81 | 100.00 87.64 79.50 | 85.58 81.57 83.90 | 100.00 95.32 98.03 |
| --- | --- | --- | --- | --- |
| -ASG ( $k{=}512$ , $m{=}48$ ) | 2.13 | 79.77 | 85.24 | 99.60 |
Table 3: BERT-base fine-tuned on BC5CDR.
To further validate ASG in specialized settings, we evaluate on the BC5CDR Named Entity Recognition task Li et al. (2016), a biomedical benchmark focused on identifying chemical and disease entities. This task poses strong vocabulary-specific requirements, making it a challenging testbed for compressed embeddings.
We compare BioBERT Lee et al. (2020) and BERT-base with standard embeddings, Semantic Grouping (SG), and ASG under varying compression levels. Results are reported in F1 score and relative performance (%Base).
| BioBERT -40% SG -ASG( $k{=}128$ , $m{=}48$ ) | 100 40 0.81 | 100.00 87.64 79.50 | 89.48 86.71 87.78 | 100.00 96.90 98.10 |
| --- | --- | --- | --- | --- |
| -ASG( $k{=}512$ , $m{=}48$ ) | 2.13 | 79.77 | 88.93 | 99.39 |
Table 4: BioBERT fine-tuned on BC5CDR.
Across both backbones, ASG preserves high task performance under strong embedding compression. Even at $<1\$ of the original embedding size, ASG recovers over 98% of the base model performance. These results demonstrate that ASG generalizes beyond general-domain benchmarks to biomedical NER.
### 3.5 Cross-Lingual Transfer (Zero-Shot)
For zero-shot cross-lingual transfer, we followed the experimental setup of Zhang et al. (2024). Models were trained solely on the English XNLI and WikiANN training sets and then evaluated on the multilingual test sets of these datasets. In this setting, ASG-enhanced models outperformed the Semantic Grouping method. While generative models using ASG with lower $k$ (clusters per segment) and $m$ (segments) values showed reduced performance in cross-lingual transfer (Table 2), configurations with $m=64$ segments nonetheless achieved at least 86% relative to baseline model performance. Using shared codebook, the performance further improved upto 93% relative to the baseline model, with just 0.25% of the embedding parameters.
### 3.6 Qualitative Analysis
Figure 2 illustrates how Aggregate Semantic Grouping (ASG) captures varied semantic facets of the token "father" through its clustering across selected segments:
- Familial Context: "father" clusters with kinship terms such as "padre" (father), "mother", and "daughter" (Segment 2), or "barn" (child), "parent", and "grandmother" (Segment 12; also Segment 16), reflecting its primary familial sense.
- Authority/Religious Context: In Segment 0, "father" groups with "Chief", "Prophet", "notables", and "religión", indicating connotations of leadership or religious reverence.
- Figurative/Abstract Contexts: Other segments link "father" to broader concepts, such as "Zeus" (mythological father figure), or with terms like "records", "govern", and "legacy" (Segment 7), potentially reflecting historical origin, or the act of establishing something significant.
## 4 Conclusion
This work investigated equipping language models with shared, compositional token representations as an alternative to traditional monolithic embeddings. We explored this through Aggregate Semantic Grouping (ASG), where Product Quantization transforms embeddings into sequences of ConceptIDs that map to shared, learned Concept Vectors, enabling multifaceted semantic capture alongside significant compression. Extensive experiments on diverse models (including mBERT, XLM-R, and mT5) and NLU tasks (such as NLI, NER, and QA) found ASG maintains high performance (often >95% relative to baseline) despite extreme parameter reduction (to <1% of original size). ASG also outperformed prior semantic grouping methods, and proved effective for generative architectures. These findings confirm that ASG’s decomposition of tokens into shared components offers an efficient, semantically rich, and promising direction for language modeling; future work may explore dynamic or adaptive quantization techniques.
## 5 Limitations
ASG was applied directly to word embeddings from pre-trained models without an explicit cross-lingual alignment step, which could refine Concept Vector clustering. This may partly explain the observed performance degradation in generative tasks within cross-lingual settings, such as on WikiANN, where better nuance preservation through alignment-optimized clustering could be beneficial. Furthermore, we did not undertake continual pre-training of the models with the ASG embeddings; such a phase could allow models to more effectively adapt to the compositional representations and potentially enhance overall performance. Complementary to this, methods similar GraphMerge Wu and Monz (2023), could potentially be combined with ASG to pre-align embeddings before clustering, leading to more coherent ConceptID assignments.
## References
- Clark et al. (2020) Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. Tydi qa: A benchmark for information-seeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
- Conneau et al. (2019) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Conneau et al. (2018) Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186.
- Hu et al. (2020) Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International conference on machine learning, pages 4411–4421. PMLR.
- Jégou et al. (2011) Herve Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128.
- Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Li et al. (2016) Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. 2016. Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database J. Biol. Databases Curation, 2016.
- Pan et al. (2017) Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
- Park et al. (2024) Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. 2024. The geometry of categorical and hierarchical concepts in large language models. arXiv preprint arXiv:2406.01506.
- Park et al. (2023) Kiho Park, Yo Joong Choe, and Victor Veitch. 2023. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658.
- Shani et al. (2025) Chen Shani, Dan Jurafsky, Yann LeCun, and Ravid Shwartz-Ziv. 2025. From tokens to thoughts: How llms and humans trade compression for meaning. arXiv preprint arXiv:2505.17117.
- Tay et al. (2022) Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, and 1 others. 2022. Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems, 35:21831–21843.
- Wang et al. (2022) Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, and 1 others. 2022. A neural corpus indexer for document retrieval. Advances in Neural Information Processing Systems, 35:25600–25614.
- Wu and Monz (2023) Di Wu and Christof Monz. 2023. Beyond shared vocabulary: Increasing representational word similarities across languages for multilingual machine translation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9749–9764, Singapore. Association for Computational Linguistics.
- Xue et al. (2020) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
- Zhang et al. (2024) Xinyu Zhang, Jing Lu, Vinh Q Tran, Tal Schuster, Donald Metzler, and Jimmy Lin. 2024. Tomato, tomahto, tomate: Measuring the role of shared semantics among subwords in multilingual language models. arXiv preprint arXiv:2411.04530.
- Zhou et al. (2022) Yujia Zhou, Jing Yao, Zhicheng Dou, Ledell Wu, Peitian Zhang, and Ji-Rong Wen. 2022. Ultron: An ultimate retriever on corpus with a model-based indexer. arXiv preprint arXiv:2208.09257.
Table 5: Model Configurations and Embedding Parameters for ASG, Underlined uses a shared codebook
| Model | k | m | Parameters | Embedding Shape (Dim) |
| --- | --- | --- | --- | --- |
| mBERT | N/A | N/A | 177M | [ 120k, 768] |
| 512 | 48 | 86M | [ 30k, 16] | |
| XLM-R | N/A | N/A | 277M | [ 250k, 768] |
| 1024 | 48 | 86M | [ 49k, 16] | |
| mT5 | N/A | N/A | 300M | [ 256k, 512] |
| 1024 | 32 | 45M | [ 36k, 16] | |
| 2048 | 32 | 46M | [ 68k, 16] | |
| 8192 | 32 | 53M | [ 265k, 16] | |
| 16384 | 32 | 44M | [ 20k, 16] | |
| 32768 | 32 | 45M | [ 36k, 16] | |
| 32768 | 64 | 44M | [ 39k, 8] | |
| 1024 | 64 | 45M | [ 72k, 8] | |
## Appendix A Experimental Setup
All our experiments are conducted using the smallest available checkpoint for each respective pre-trained model. Training is performed with a batch size of 128, and all experiments were run on a single Nvidia L40 GPU.
For the encoder models (mBERT and XLM-R), we set a weight decay of $0.01$ . The learning rate was $5\times 10^{-6}$ for XNLI experiments and $5\times 10^{-5}$ for WikiANN experiments. These models were trained for 2 epochs; for cross-lingual transfer settings, training was extended to 5 epochs. The mT5 model was trained with a learning rate of $1\times 10^{-3}$ .
Product Quantization is implemented using the nanopq library https://github.com/matsui528/nanopq. For the k-means clustering within nanopq we use the faiss library https://faiss.ai/.
## Appendix B Parameter Reduction Calculation
In a standard model, the token embedding table has shape $(V,D)$ , where $V$ is the vocabulary size and $D$ is the embedding dimension.
With ASG, the embedding layer is replaced by $m$ codebooks, each with $k$ Concept Vectors of size $D/m$ . The ASG embedding matrix thus has shape:
$$
(k\cdot m,\tfrac{D}{m})
$$
The number of parameters becomes:
$$
k\cdot m\cdot\tfrac{D}{m}=k\cdot D
$$
So the ratio of ASG to original embedding parameters is:
$$
\frac{k\cdot D}{V\cdot D}=\frac{k}{V}
$$
For example, in XLM-R with $V=250{,}000$ , $k=1024$ , and $m=48$ , the ASG matrix has shape $(49{,}000,16)$ , and the compression ratio is:
$$
\frac{49{,}000\cdot 16}{250{,}000\cdot 768}\approx 0.0040\,(=0.4\
$$
## Appendix C Token-to-ConceptID Mapping Matrix.
Each token is mapped to a sequence of $m$ ConceptIDs (integers in $[0,k)$ ), forming a matrix of shape $(V,m)$ . This mapping is:
- Not a parameter: it is precomputed, fixed, and non-trainable.
- Stored externally and could be used during tokenization.
- Extremely compact, as it contains only small integers.
For $V=250$ k and $m=48$ , this mapping can be stored with an overhead of $\sim 2\$ ( $15$ MB) or $\sim 2.15\$ ( $16.5$ MB) of the memory required for the original embedding matrix, when $k=1024$ or $k=2048$ respectively.
## Appendix D ASG Configuration and Embedding Parameters
Table 5 provides a summary of the configurations used for Aggregate Semantic Grouping (ASG) across different models, alongside details for the original base models. The table specifies the choices for $k$ (number of centroids per subspace) and $m$ (number of subspaces) for each ASG setup. It also lists the resulting total number of parameters in the model and the shape of the embbeding layer.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Table: Multilingual Word Clusters by Segment
### Overview
The image displays a data table with three columns, presenting information about word clusters grouped by segment. Each row represents a distinct segment, showing the number of tokens (words) in that cluster and a list of the specific words contained within it. The words are multilingual, appearing in scripts including Latin, Chinese, Japanese, Korean, Arabic, Hebrew, Devanagari, and others.
### Components/Axes
The table has three columns:
1. **Segment (m=48)**: The leftmost column. It lists segment identifiers, which are non-sequential integers: 0, 1, 2, 7, 12, 16. The header includes the notation "(m=48)", suggesting a parameter or total count related to the segmentation process.
2. **Number of tokens in Cluster**: The middle column. It provides a numerical count of the words (tokens) found in each segment's cluster.
3. **Words in Cluster**: The rightmost and widest column. It contains a comma-separated list of the actual words belonging to each cluster. The words are presented in their original scripts and are enclosed in single quotes.
### Detailed Analysis
The table contains six data rows. Below is a precise transcription of each row's content.
**Row 1 (Segment 0):**
* **Segment:** 0
* **Number of tokens:** 27
* **Words in Cluster:** `爹`, `爸`, `爹`, `father`, `paris`, `Chief`, `chief`, `Ho`, `abimóndos`, `Mother`, `old`, `##old`, `Telegraph`, `earth`, `##रू`, `notables`, `adta`, `religión`, `metan`, `Israel`, `praise`, `Prophet`, `יהוה`, `Thief`, `voter`, `Capitaine`
**Row 2 (Segment 1):**
* **Segment:** 1
* **Number of tokens:** 29
* **Words in Cluster:** `父`, `father`, `piccolo`, `Zeus`, `Castello`, `##нтр`, `##ов`, `senjata`, `##oi`, `antichi`, `iniziale`, `ىـ`, `Verenigd`, `apariencia`
**Row 3 (Segment 2):**
* **Segment:** 2
* **Number of tokens:** 22
* **Words in Cluster:** `father`, `padre`, `daughter`, `mother`, `смн`, `##ל`, `Incidente`, `baby`, `daughters`, `V6`, `grandson`, `पोता`, `afromoths`, `Michaela`, `ппопка`, `семейството`, `Mana`, `lluz`, `##म`, `Potok`, `смт`, `##ग`
**Row 4 (Segment 7):**
* **Segment:** 7
* **Number of tokens:** 19
* **Words in Cluster:** `爹`, `top`, `father`, `records`, `govern`, `##ने`, `corona`, `Hartmann`, `##čka`, `Учhuququjhú`, `Dakar`, `Pons`, `legacy`, `##čić`, `портер`, `marge`, `naturally`, `अत्याचारी`, `McDowell`
**Row 5 (Segment 12):**
* **Segment:** 12
* **Number of tokens:** 24
* **Words in Cluster:** `##ی`, `father`, `dad`, `barn`, `##ع`, `##ی`, `Infantry`, `Elementary`, `children`, `coupe`, `parent`, `##CA`, `Injuries`, `##RC`, `##GC`, `connect`, `cache`, `ك`, `kaloi`, `Lahti`, `indicato`, `seco`, `grandmother`, `wheat`
**Row 6 (Segment 16):**
* **Segment:** 16
* **Number of tokens:** 13
* **Words in Cluster:** `father`, `debut`, `mother`, `port`, `склад`, `##Dai`, `Beginning`, `uncle`, `##дим`, `Straży`, `comengament`, `##rusade`, `grandmother`
### Key Observations
1. **Multilingual Content:** The clusters contain words from numerous languages and scripts. Recognizable languages include English, Spanish (`padre`, `religión`), Italian (`piccolo`, `antichi`), Chinese (`爹`, `父`), Japanese (`ポ`), Korean (`##구`), Arabic (`ىـ`, `ك`), Hebrew (`יהוה`, `##ל`), Hindi/Devanagari (`पोता`, `अत्याचारी`), and several others using Cyrillic and Latin diacritics.
2. **Thematic Clustering:** There is a strong, recurring theme of **family and kinship terms** across multiple segments. Words like "father," "mother," "daughter," "son," "grandson," "uncle," "grandmother," "baby," "children," and "parent" appear frequently, often alongside their translations in other languages (e.g., `padre`, `爹`, `父`).
3. **Token Count Variation:** The number of tokens per cluster varies, ranging from a low of 13 (Segment 16) to a high of 29 (Segment 1).
4. **Non-Family Terms:** Each cluster also contains a mix of other, seemingly unrelated words (e.g., "Telegraph," "Zeus," "Infantry," "wheat," "cache"), suggesting the clustering algorithm may be grouping words based on contextual usage in a corpus rather than strict semantic similarity alone.
5. **Subword Tokens:** The presence of tokens starting with "##" (e.g., `##old`, `##нтр`, `##oi`) indicates these are likely subword units from a tokenization process (like Byte-Pair Encoding or WordPiece), commonly used in natural language processing models.
### Interpretation
This table appears to be an output from a natural language processing (NLP) analysis, specifically a **word clustering or embedding visualization task**. The "Segment (m=48)" likely refers to a cluster ID from an algorithm like k-means run with k=48, or a segment from a dimensionality reduction technique like UMAP or t-SNE.
The data demonstrates that the underlying model has grouped multilingual synonyms and related concepts together. The prominent family-term clusters suggest the model has learned cross-lingual semantic relationships for core vocabulary. The inclusion of subword tokens (`##...`) and a mix of thematic words indicates the clusters are based on distributional semantics—how words are used in context across a large, multilingual text corpus—rather than a simple dictionary lookup.
The variation in cluster size (token count) shows that some semantic concepts (like Segment 1's cluster) are more densely populated in the embedding space or the source data than others (like Segment 16's cluster). The presence of seemingly out-of-place words within a family-themed cluster (e.g., "wheat" in Segment 12) could be an artifact of the clustering algorithm, a reflection of metaphorical usage in the training data, or an indicator of a broader, more abstract conceptual grouping that isn't immediately obvious.
</details>
Figure 2: Grouping of the token "father" at a few selected subspaces