# Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics
**Authors**:
- Kavin R V (Indian Institute of Technology)
- Kharagpur, WB, India
- &Pawan Goyal (Indian Institute of Technology)
- Kharagpur, WB, India
Abstract
Standard language models employ unique, monolithic embeddings for each token, potentially limiting their ability to capture the multifaceted nature of word meanings. We investigate whether tokens can be more effectively represented through a compositional structure that accumulates diverse semantic facets. To explore this, we propose Aggregate Semantic Grouping (ASG), a novel approach leveraging Product Quantization (PQ). We apply ASG to standard transformer architectures (mBERT, XLM-R, mT5) and evaluate this representational scheme across diverse tasks (NLI, NER, QA), as well as a biomedical domain-specific benchmark (BC5CDR) using BioBERT. Our findings demonstrate that representing tokens compositionally via ASG achieves extreme compression in embedding parameters (0.4–0.5%) while maintaining $>$ 95% task performance relative to the base model, even in generative tasks and extends to both cross lingual transfer and domain-specific settings. These results validate the principle that tokens can be effectively modeled as combinations of shared semantic building blocks. ASG offers a simple yet concrete method for achieving this, showcasing how compositional representations can capture linguistic richness while enabling compact yet semantically rich models.
Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics
Kavin R V Indian Institute of Technology Kharagpur, WB, India kavinrv13@gmail.com Pawan Goyal Indian Institute of Technology Kharagpur, WB, India pawang@cse.iitkgp.ac.in
1 Introduction
In modern language models, each token is typically represented by an individual, unique embedding. However, this approach may not be optimal, as semantically similar tokens (e.g., "mother," "mom," and their respective translations in different languages) can be assigned entirely distinct representations, potentially overlooking shared conceptual underpinnings. Recent works (Park et al., 2023, 2024; Shani et al., 2025) suggests that token representations in LLMs implicitly encode higher-level semantic regularities, often described as concepts, which may be shared across words or subwords. While these studies analyze such concepts as emergent semantic categories or directions in representation space, our work explores an explicit, compositional formulation where tokens are represented as sequences of shared Concept Vectors. In parallel, Zhang et al. (2024) proposed concept-level representations, grouping semantically similar tokens, using k-means. While this method achieved significant vocabulary compression with retained performance, it struggles with polysemy (e.g., "father" as family vs. religious figure) and is limited to encoder-only models, hindered by not explicitly predicting subword in autoregressive decoding.
<details>
<summary>images/main.png Details</summary>

### Visual Description
\n
## Diagram: Word Embedding Transformation
### Overview
This diagram illustrates a process of transforming word embeddings into new embeddings (concept vectors) through a series of segments and concept groupings. It shows how individual word embeddings (W1-W10) are processed and potentially combined to form new representations. The diagram highlights the transformation of a specific token, w6, from its old embedding to a new embedding.
### Components/Axes
The diagram consists of three main sections:
1. **Word Embeddings:** A grid representing the initial word embeddings. The rows are labeled W1 through W10, and the columns are divided into three segments: Segment 1, Segment 2, and Segment 3.
2. **New Embedding (Concept Vectors):** Three circular groupings of colored areas (c1-c12) representing the new embeddings or concept vectors. Each grouping is visually distinct and connected by arrows indicating transformation.
3. **Token w6 Transformation:** A vertical dashed line separating the old and new input embeddings for token w6. The old embedding is represented by a single vector, while the new embedding is composed of three vectors: c2, c8, and c10.
### Detailed Analysis or Content Details
**Word Embeddings Section:**
* The grid is 10 rows x 3 columns.
* Segment 1 (leftmost column) is predominantly blue, with some red in rows W2 and W3.
* Segment 2 (middle column) is predominantly red, with some blue in rows W1 and W4.
* Segment 3 (rightmost column) is predominantly green, with some red in rows W1, W2, and W3, and some blue in rows W5 and W6.
* The color intensity varies within each segment, suggesting different values within the embeddings.
**New Embedding (Concept Vectors) Section:**
* **Grouping 1 (Top):** Contains concepts c1, c2, c3, and c4. c1 and c4 are blue, c2 and c3 are light blue.
* **Grouping 2 (Middle):** Contains concepts c5, c6, c7, and c8. c5 and c7 are light blue, c6 and c8 are blue.
* **Grouping 3 (Bottom):** Contains concepts c9, c10, c11, and c12. c9 and c11 are light blue, c10 and c12 are blue.
* Arrows connect the Word Embeddings section to these groupings, indicating a transformation process. The arrows are not labeled with specific weights or functions.
**Token w6 Transformation Section:**
* The "Old Input embedding for token w6" is a single vector.
* The "New Input embedding for token w6" is composed of three vectors: c2, c8, and c10.
* The dashed line visually separates the old and new embeddings.
### Key Observations
* The transformation process appears to decompose a single word embedding (w6) into a combination of concept vectors (c2, c8, c10).
* The color scheme suggests that similar concepts are grouped together.
* The segments in the Word Embeddings section may represent different aspects or features of the words.
* The diagram does not provide numerical values for the embeddings, only visual representations.
### Interpretation
The diagram illustrates a concept of distributed representation learning, where words are represented as vectors in a high-dimensional space. The transformation process suggests a way to refine or decompose these embeddings into more meaningful concept vectors. The segmentation of the initial word embeddings implies that the words are being analyzed based on different features or contexts. The decomposition of w6 into c2, c8, and c10 suggests that the meaning of w6 can be expressed as a combination of these concepts.
The diagram doesn't provide specific details about the transformation function or the criteria for grouping concepts. It's a high-level illustration of a potential process for creating more nuanced and context-aware word representations. The use of color coding is a visual aid to understand the relationships between different concepts and segments. The diagram is conceptual and does not contain quantifiable data. It is a visual representation of a process, not a presentation of results.
</details>
Figure 1: Overview of the Aggregate Semantic Grouping (ASG) method for creating compositional token embeddings. Product Quantization is applied to the original word embedding layer. Embeddings are segmented into $m$ sub-vectors. For each of the $m$ segment positions, k-means clustering is performed on the corresponding sub-vectors from all tokens to learn a codebook of $k$ Concept Vectors (centroids). The new ASG embedding layer containing these learned Concept Vectors is initialized as the embedding layer. Instead of using the original input embedding for a token ‘w’, a sequence of $m$ ConceptIDs used to get their respective Concept Vectors from the ASG layer, these are then concatenated to form the new representation for token ‘w’.
To address these limitations, we introduce Aggregate Semantic Grouping (ASG). ASG maintains concept-level sharing but represents tokens as sequences of ‘conceptIDs’, thereby accumulating multiple semantic facets. This sequence-based representation is inspired by successful applications in information and generative retrieval Wang et al. (2022); Tay et al. (2022); Zhou et al. (2022). We employ Product Quantization (PQ) (Jégou et al., 2011) to transform tokens into these conceptID sequences, aiming to preserve token’s uniqueness and nuances while benefiting from shared semantics.
Our primary contribution is the introduction of Aggregate Semantic Grouping (ASG), a novel method leveraging Product Quantization to represent tokens as sequences of shared ConceptIDs, thereby capturing multiple semantic facets while significantly compressing embedding layer parameters. We provide a detailed methodology for applying ASG to both encoder and encoder–decoder transformer models. Conducting experiments across diverse tasks (NLI, NER, QA) and models (mBERT, XLM-R, mT5), we demonstrate that even with extreme compression on embeddings (down to 0.4–0.5% of the original embedding parameters), ASG maintains high performance (often $>$ 95% relative to baseline) and outperforms the prior semantic grouping method (Zhang et al., 2024), including in zero-shot cross-lingual transfer scenarios. Furthermore, we extend our evaluation to a domain-specific benchmark (BC5CDR; Li et al., 2016) and a domain-specialized model (BioBERT; Lee et al., 2020), where ASG achieves similar robustness, confirming its applicability beyond general-domain tasks. The code will be available at https://github.com/KavinRV/Aggregate-Semantic-Grouping.
2 Aggregate Semantic Grouping (ASG)
Our approach, Aggregate Semantic Grouping, reframes token representation by learning compositional embeddings from pre-trained models.
2.1 Learning Concept Vectors via Product Quantization
We begin with a pre-trained word embedding matrix $E$ , where each row is a $D$ -dimensional vector for a token in a vocabulary of size $V$ . Using Product Quantization (PQ), each $D$ -dimensional embedding is first divided into $m$ distinct segments (sub-vectors), each of dimension $D/m$ . For each of these $m$ segment positions, we apply k-means clustering to the collection of all corresponding segments from every token in the vocabulary. This process yields $m$ distinct codebooks; each codebook $C_{i}$ (for $i=0,...,m-1$ ) contains $k$ centroids, termed Concept Vectors, specific to that segment position. Each Concept Vector is of dimension $D/m$ .
2.2 ASG Embedding Layer Initialization
The $m$ distinct codebooks ( $C_{0},C_{1},...,C_{m-1}$ ), where each codebook $C_{i}$ contains $k$ Concept Vectors of dimension $D/m$ , are concatenated to form a single, new embedding matrix $E^{\prime}$ . This matrix $E^{\prime}$ has dimensions $(m× k)×(D/m)$ and stores all unique Concept Vectors. Specifically, the $j$ -th Concept Vector (where $j∈[0,k-1]$ ) from the $i$ -th codebook $C_{i}$ is located at row $i× k+j$ within $E^{\prime}$ .
Each token is then mapped to a sequence of $m$ ConceptIDs. For each of its $m$ embedding segments, the corresponding ConceptID is the specific row index in $E^{\prime}$ that stores the chosen Concept Vector for that segment. This row index is determined as $i× k+s_{i}$ , where $i$ is the segment index (from $0$ to $m-1$ ) and $s_{i}$ is the index (from $0$ to $k-1$ ) of the selected centroid from the $i$ -th segment’s codebook. This sequence of $m$ row indices (ConceptIDs) thus identifies the set of Concept Vectors representing the token.
Table 1: Evaluation results across cluster granularities for mBert and XLM-R on multilingual benchmarks. Scores include F1, Accuracy, and relative performance (%Base). For XNLI %Base is for the accuracy relative to the base model. 40% SG: Semantic Grouping as mentioned in Zhang et al. (2024). In the Zero-Shot setting the models were trained on english dataset and have been tested on all the languages.
| mBERT | 100.00 | 100.00 | 75.46 | 74.79 | 100.00 | 89.74 | 100.00 | 64.86 | 100.00 | 58.58 | 100.00 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| -40% SG | 40.00 | 68.95 | 72.43 | 71.88 | 95.99 | 86.69 | 96.61 | 60.64 | 93.49 | 52.35 | 89.37 |
| -ASG( $k$ =512, $m$ =48) | 0.50 | 48.65 | 73.51 | 72.84 | 97.42 | 88.11 | 98.19 | 61.30 | 94.51 | 55.71 | 95.10 |
| XLM-R | 100.00 | 100.00 | 77.98 | 77.28 | 100.00 | 88.37 | 100.000 | 71.94 | 100.00 | 58.74 | 100.00 |
| -40% SG | 40.00 | 58.48 | 74.56 | 73.96 | 95.61 | 84.57 | 95.70 | 65.83 | 91.51 | 51.48 | 87.65 |
| -ASG( $k$ =1024, $m$ =48) | 0.40 | 31.08 | 77.06 | 76.39 | 98.81 | 86.53 | 97.92 | 68.05 | 67.39 | 54.46 | 92.72 |
2.3 Token Representation with ASG
When a token is processed, its pre-computed sequence of $m$ ConceptIDs is used to retrieve the corresponding $m$ Concept Vectors from their respective codebooks within $E^{\prime}$ . Let these retrieved Concept Vectors be $v_{0},v_{1},...,v_{m-1}$ , where each $v_{i}$ has dimension $D/m$ . The final ASG representation for the token, $e^{\prime}∈\mathbb{R}^{D}$ , is obtained by concatenating these $m$ Concept Vectors:
$$
e^{\prime}=\text{concat}(v_{0},v_{1},\dots,v_{m-1}) \tag{1}
$$
This vector $e^{\prime}$ serves as the input to subsequent layers of the model.
2.4 Application to Generative Models
For model with decoder, which have separate input and output embedding layers (the latter often serving as token classifier weights), we apply the ASG process to both. This results in two distinct ASG embedding structures: one for input token representations ( $E^{\prime}$ ) and another for the output layer ( $OE^{\prime}$ ), each derived from their respective original embedding matrices.
Table 2: Evaluation results for Generative models across cluster granularities for mT5 on TyDiQA and WikiANN. Seperate: 1 codebook per segment, Shared: codebooks shared across all segments, In the Zero-Shot setting the models were trained on English dataset and have been tested on all the languages.
| Model | Parameter Reduced to (%) | TyDIQA | WikiANN | WikiANN (Zero-Shot) | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Embedding | Model | F1 | EM | %Base | F1 | %Base | F1 | %Base | | |
| mT5 | 100.00 | 100.00 | 70.74 | 56.20 | 100.00 | 84.21 | 100.00 | 50.75 | 100.00 | |
| ASG Separate | -( $k=1024,m=32$ ) | 0.45 | 15.06 | 60.67 | 46.15 | 85.76 | 79.85 | 94.82 | 25.84 | 50.91 |
| -( $k=2048,m=32$ ) | 0.85 | 15.41 | 63.81 | 49.06 | 90.19 | 80.93 | 96.11 | 29.85 | 58.82 | |
| -( $k=8192,m=32$ ) | 3.32 | 17.51 | 66.22 | 51.71 | 93.61 | 82.19 | 97.60 | 33.51 | 66.03 | |
| -( $k=1024,m=64$ ) | 0.45 | 15.06 | 69.96 | 55.53 | 98.89 | 83.18 | 98.78 | 44.02 | 86.74 | |
| ASG Shared | -( $k=16384,m=32$ ) | 0.25 | 14.89 | 66.50 | 51.90 | 93.99 | 81.65 | 96.96 | 34.01 | 67.02 |
| -( $k=32768,m=32$ ) | 0.45 | 15.06 | 67.00 | 53.06 | 94.71 | 82.04 | 97.42 | 37.01 | 72.92 | |
| -( $k=32768,m=64$ ) | 0.25 | 14.89 | 70.81 | 56.51 | 100.09 | 84.19 | 99.97 | 47.23 | 93.06 | |
Output Logit Calculation: To compute the logit $l_{t}$ for a target token $t$ , the final hidden state $H∈\mathbb{R}^{D}$ from the model is first segmented into $m$ parts: $H=[H_{0},H_{1},...,H_{m-1}]$ , where each $H_{i}∈\mathbb{R}^{D/m}$ . Let the sequence of Concept Vectors for token $t$ be $u_{t,0},u_{t,1},...,u_{t,m-1}∈ OE^{\prime}$ . The logit is calculated as:
$$
l_{t}=\sum_{i=0}^{m-1}H_{i}\cdot u_{t,i} \tag{2}
$$
3 Experiments and Results
3.1 Datasets
We evaluate our proposed ASG method on diverse cross-lingual benchmarks for natural language inference (NLI), question answering (QA), and named entity recognition (NER). These include: XNLI (Conneau et al., 2018), a 15-language sentence understanding benchmark; the Gold Passage (GoldP) task of TyDi QA (Clark et al., 2020), an 11-language QA dataset where gold context is provided; and the XTREME benchmark version (Hu et al., 2020) of WikiANN (Pan et al., 2017), a 40-language NER dataset.
3.2 Settings
For $k$ , values were generally chosen as powers of two. This allowed us to systematically target specific levels of embedding parameter compression, aiming for reductions that brought the ASG embedding layer size to approximately 0.5%, 1%, and 4% of the original embedding parameters. Regarding the number of subspaces $m$ , our explorations indicated that too few subspaces (e.g., $m=16$ ) resulted in a significant degradation of model performance. Conversely, using very high values for $m$ (e.g., 128, 256, or 512), would lead to extremely small dimensions for each segment ( $D/m$ , potentially as low as 4, 2, or 1 for common embedding sizes $D$ ) and would consequently require very long sequences of ConceptIDs (length $m$ ) to represent each token. These considerations led us to focus on $m$ values within a moderate range for the experiments detailed below.
3.3 Fine-tuning Performance
We evaluated ASG on encoder-only mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2019) models using the XNLI and WikiANN datasets, mainly to compare it’s effectiveness against the Semantic grouping as mention in Zhang et al. (2024). As demonstrated in Table 1, ASG achieves significant embedding compression while maintaining over 97% of baseline performance and notably outperforms Semantic Grouping (SG) method, even with a low $k$ value.
To assess ASG for generative tasks, we then evaluated the mT5 model (Xue et al., 2020) on the TyDiQA and WikiANN datasets, applying ASG to both its input and output embeddings. Table 2, detailing results for various cluster ( $k$ ) and subspace ( $m$ ) configurations, shows ASG consistently achieved over 85% of baseline mT5 performance. Specifically, with $k≥ 2048$ , relative performance on TyDiQA surpassed 90%, while on WikiANN, ASG configurations generally exceeded 95% of the baseline.
Furthermore, for mT5, we investigated a variant employing a single shared codebook across all $m$ subspaces. To achieve this, the $m$ segments from all token embeddings in the vocabulary are pooled together before applying k-means clustering. This yields one global codebook of Concept Vectors. Each of the $m$ ConceptIDs for a token then selects a Concept Vector from this single shared codebook to represent its corresponding segment. This shared codebook is then used across all $m$ positions for constructing the token representation. This approach, despite reducing the diversity of available Concept Vectors, impressively maintained over 95% relative performance across both TyDiQA and WikiANN. This suggests that a highly restricted set of output Concept Vectors can still be effective for generative tasks.
3.4 Domain-Specific Evaluation (BC5CDR)
| BERT-base -40% SG -ASG ( $k{=}128$ , $m{=}48$ ) | 100 40 0.81 | 100.00 87.64 79.50 | 85.58 81.57 83.90 | 100.00 95.32 98.03 |
| --- | --- | --- | --- | --- |
| -ASG ( $k{=}512$ , $m{=}48$ ) | 2.13 | 79.77 | 85.24 | 99.60 |
Table 3: BERT-base fine-tuned on BC5CDR.
To further validate ASG in specialized settings, we evaluate on the BC5CDR Named Entity Recognition task Li et al. (2016), a biomedical benchmark focused on identifying chemical and disease entities. This task poses strong vocabulary-specific requirements, making it a challenging testbed for compressed embeddings.
We compare BioBERT Lee et al. (2020) and BERT-base with standard embeddings, Semantic Grouping (SG), and ASG under varying compression levels. Results are reported in F1 score and relative performance (%Base).
| BioBERT -40% SG -ASG( $k{=}128$ , $m{=}48$ ) | 100 40 0.81 | 100.00 87.64 79.50 | 89.48 86.71 87.78 | 100.00 96.90 98.10 |
| --- | --- | --- | --- | --- |
| -ASG( $k{=}512$ , $m{=}48$ ) | 2.13 | 79.77 | 88.93 | 99.39 |
Table 4: BioBERT fine-tuned on BC5CDR.
Across both backbones, ASG preserves high task performance under strong embedding compression. Even at $<1\%$ of the original embedding size, ASG recovers over 98% of the base model performance. These results demonstrate that ASG generalizes beyond general-domain benchmarks to biomedical NER.
3.5 Cross-Lingual Transfer (Zero-Shot)
For zero-shot cross-lingual transfer, we followed the experimental setup of Zhang et al. (2024). Models were trained solely on the English XNLI and WikiANN training sets and then evaluated on the multilingual test sets of these datasets. In this setting, ASG-enhanced models outperformed the Semantic Grouping method. While generative models using ASG with lower $k$ (clusters per segment) and $m$ (segments) values showed reduced performance in cross-lingual transfer (Table 2), configurations with $m=64$ segments nonetheless achieved at least 86% relative to baseline model performance. Using shared codebook, the performance further improved upto 93% relative to the baseline model, with just 0.25% of the embedding parameters.
3.6 Qualitative Analysis
Figure 2 illustrates how Aggregate Semantic Grouping (ASG) captures varied semantic facets of the token "father" through its clustering across selected segments:
- Familial Context: "father" clusters with kinship terms such as "padre" (father), "mother", and "daughter" (Segment 2), or "barn" (child), "parent", and "grandmother" (Segment 12; also Segment 16), reflecting its primary familial sense.
- Authority/Religious Context: In Segment 0, "father" groups with "Chief", "Prophet", "notables", and "religión", indicating connotations of leadership or religious reverence.
- Figurative/Abstract Contexts: Other segments link "father" to broader concepts, such as "Zeus" (mythological father figure), or with terms like "records", "govern", and "legacy" (Segment 7), potentially reflecting historical origin, or the act of establishing something significant.
4 Conclusion
This work investigated equipping language models with shared, compositional token representations as an alternative to traditional monolithic embeddings. We explored this through Aggregate Semantic Grouping (ASG), where Product Quantization transforms embeddings into sequences of ConceptIDs that map to shared, learned Concept Vectors, enabling multifaceted semantic capture alongside significant compression. Extensive experiments on diverse models (including mBERT, XLM-R, and mT5) and NLU tasks (such as NLI, NER, and QA) found ASG maintains high performance (often >95% relative to baseline) despite extreme parameter reduction (to <1% of original size). ASG also outperformed prior semantic grouping methods, and proved effective for generative architectures. These findings confirm that ASG’s decomposition of tokens into shared components offers an efficient, semantically rich, and promising direction for language modeling; future work may explore dynamic or adaptive quantization techniques.
5 Limitations
ASG was applied directly to word embeddings from pre-trained models without an explicit cross-lingual alignment step, which could refine Concept Vector clustering. This may partly explain the observed performance degradation in generative tasks within cross-lingual settings, such as on WikiANN, where better nuance preservation through alignment-optimized clustering could be beneficial. Furthermore, we did not undertake continual pre-training of the models with the ASG embeddings; such a phase could allow models to more effectively adapt to the compositional representations and potentially enhance overall performance. Complementary to this, methods similar GraphMerge Wu and Monz (2023), could potentially be combined with ASG to pre-align embeddings before clustering, leading to more coherent ConceptID assignments.
References
- Clark et al. (2020) Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. Tydi qa: A benchmark for information-seeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
- Conneau et al. (2019) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Conneau et al. (2018) Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186.
- Hu et al. (2020) Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International conference on machine learning, pages 4411–4421. PMLR.
- Jégou et al. (2011) Herve Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128.
- Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Li et al. (2016) Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. 2016. Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database J. Biol. Databases Curation, 2016.
- Pan et al. (2017) Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada. Association for Computational Linguistics.
- Park et al. (2024) Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. 2024. The geometry of categorical and hierarchical concepts in large language models. arXiv preprint arXiv:2406.01506.
- Park et al. (2023) Kiho Park, Yo Joong Choe, and Victor Veitch. 2023. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658.
- Shani et al. (2025) Chen Shani, Dan Jurafsky, Yann LeCun, and Ravid Shwartz-Ziv. 2025. From tokens to thoughts: How llms and humans trade compression for meaning. arXiv preprint arXiv:2505.17117.
- Tay et al. (2022) Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, and 1 others. 2022. Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems, 35:21831–21843.
- Wang et al. (2022) Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, and 1 others. 2022. A neural corpus indexer for document retrieval. Advances in Neural Information Processing Systems, 35:25600–25614.
- Wu and Monz (2023) Di Wu and Christof Monz. 2023. Beyond shared vocabulary: Increasing representational word similarities across languages for multilingual machine translation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9749–9764, Singapore. Association for Computational Linguistics.
- Xue et al. (2020) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
- Zhang et al. (2024) Xinyu Zhang, Jing Lu, Vinh Q Tran, Tal Schuster, Donald Metzler, and Jimmy Lin. 2024. Tomato, tomahto, tomate: Measuring the role of shared semantics among subwords in multilingual language models. arXiv preprint arXiv:2411.04530.
- Zhou et al. (2022) Yujia Zhou, Jing Yao, Zhicheng Dou, Ledell Wu, Peitian Zhang, and Ji-Rong Wen. 2022. Ultron: An ultimate retriever on corpus with a model-based indexer. arXiv preprint arXiv:2208.09257.
Table 5: Model Configurations and Embedding Parameters for ASG, Underlined uses a shared codebook
| Model | k | m | Parameters | Embedding Shape (Dim) |
| --- | --- | --- | --- | --- |
| mBERT | N/A | N/A | 177M | [ 120k, 768] |
| 512 | 48 | 86M | [ 30k, 16] | |
| XLM-R | N/A | N/A | 277M | [ 250k, 768] |
| 1024 | 48 | 86M | [ 49k, 16] | |
| mT5 | N/A | N/A | 300M | [ 256k, 512] |
| 1024 | 32 | 45M | [ 36k, 16] | |
| 2048 | 32 | 46M | [ 68k, 16] | |
| 8192 | 32 | 53M | [ 265k, 16] | |
| 16384 | 32 | 44M | [ 20k, 16] | |
| 32768 | 32 | 45M | [ 36k, 16] | |
| 32768 | 64 | 44M | [ 39k, 8] | |
| 1024 | 64 | 45M | [ 72k, 8] | |
Appendix A Experimental Setup
All our experiments are conducted using the smallest available checkpoint for each respective pre-trained model. Training is performed with a batch size of 128, and all experiments were run on a single Nvidia L40 GPU.
For the encoder models (mBERT and XLM-R), we set a weight decay of $0.01$ . The learning rate was $5× 10^{-6}$ for XNLI experiments and $5× 10^{-5}$ for WikiANN experiments. These models were trained for 2 epochs; for cross-lingual transfer settings, training was extended to 5 epochs. The mT5 model was trained with a learning rate of $1× 10^{-3}$ .
Product Quantization is implemented using the nanopq library https://github.com/matsui528/nanopq. For the k-means clustering within nanopq we use the faiss library https://faiss.ai/.
Appendix B Parameter Reduction Calculation
In a standard model, the token embedding table has shape $(V,D)$ , where $V$ is the vocabulary size and $D$ is the embedding dimension.
With ASG, the embedding layer is replaced by $m$ codebooks, each with $k$ Concept Vectors of size $D/m$ . The ASG embedding matrix thus has shape:
$$
(k\cdot m,\tfrac{D}{m})
$$
The number of parameters becomes:
$$
k\cdot m\cdot\tfrac{D}{m}=k\cdot D
$$
So the ratio of ASG to original embedding parameters is:
$$
\frac{k\cdot D}{V\cdot D}=\frac{k}{V}
$$
For example, in XLM-R with $V=250{,}000$ , $k=1024$ , and $m=48$ , the ASG matrix has shape $(49{,}000,16)$ , and the compression ratio is:
$$
\frac{49{,}000\cdot 16}{250{,}000\cdot 768}\approx 0.0040\,(=0.4\%).
$$
Appendix C Token-to-ConceptID Mapping Matrix.
Each token is mapped to a sequence of $m$ ConceptIDs (integers in $[0,k)$ ), forming a matrix of shape $(V,m)$ . This mapping is:
- Not a parameter: it is precomputed, fixed, and non-trainable.
- Stored externally and could be used during tokenization.
- Extremely compact, as it contains only small integers.
For $V=250$ k and $m=48$ , this mapping can be stored with an overhead of $\sim 2\%$ ( $15$ MB) or $\sim 2.15\%$ ( $16.5$ MB) of the memory required for the original embedding matrix, when $k=1024$ or $k=2048$ respectively.
Appendix D ASG Configuration and Embedding Parameters
Table 5 provides a summary of the configurations used for Aggregate Semantic Grouping (ASG) across different models, alongside details for the original base models. The table specifies the choices for $k$ (number of centroids per subspace) and $m$ (number of subspaces) for each ASG setup. It also lists the resulting total number of parameters in the model and the shape of the embbeding layer.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Data Table: Word Clusters
### Overview
The image presents a data table displaying word clusters identified from a text corpus. The table has three columns: "Segment (m=48)", "Number of tokens in Cluster", and "Words in Cluster". Each row represents a different cluster of words, along with the number of tokens (words) in that cluster and the words themselves. The data appears to be related to familial terms and associated concepts.
### Components/Axes
* **Segment (m=48):** An integer identifier for each cluster, ranging from 0 to 16. The "(m=48)" likely indicates the total number of segments or the size of the model used for clustering.
* **Number of tokens in Cluster:** An integer representing the count of words within each cluster. Values range from 13 to 29.
* **Words in Cluster:** A string containing a comma-separated list of words associated with the cluster. These words include both English and non-English terms, often with special characters or hashtags.
### Detailed Analysis or Content Details
Here's a reconstruction of the data table content, cluster by cluster:
* **Segment 0:** Number of tokens: 27. Words: 'father', 'parts', 'Chief', 'chief', 'Ho', 'abinadons', 'Mother', 'old', '#old', 'Telegraph', 'earth', '#زمین', 'notables', 'adta', 'religion', 'wetran', 'Israel', 'praise', 'Prophet', 'שיר', 'Thief', 'voter', 'Capitaine'.
* **Segment 1:** Number of tokens: 29. Words: 'father', 'piccolo', 'Zeus', 'Castello', 'центр', '#рука', 'senjata', '#or', 'antichi', 'iniziale', 'صمم', 'Verenigd', 'apariencia'.
* **Segment 2:** Number of tokens: 22. Words: 'father', 'padre', 'daughter', 'mother', 'сын', 'd', 'incident', 'baby', 'daughters', 'V6', 'grandson', 'पराक्रम', 'afromoths', 'Michaela', 'тпэло', 'семейство', 'Mana', 'iluz', '호', 'Potok', 'смт', '#田'.
* **Segment 7:** Number of tokens: 19. Words: 'father', 'records', 'govern', '#fine', 'corona', 'Hartmann', '#icka', 'Uheuquqwh', 'Dakar', 'Pons', 'legacy', '#oxi', 'nortpert', 'marge', 'naturally', 'अतिसर्रानीय', 'McDowell'.
* **Segment 12:** Number of tokens: 24. Words: '#', 'father', 'ef', 'barn', '#', 'infantry', 'Elementary', 'kinderen', 'coupe', 'parent', '#BCA', 'injuries', '#RC', '#GC', 'connect', 'cache', 'علق', 'kaldi', 'Lahl', 'indicato', 'seco', 'grandmother', 'wheat'.
* **Segment 16:** Number of tokens: 13. Words: 'father', 'début', 'mother', 'port', 'склад', 'Daí', 'Beginning', 'uncle', '#дним', 'Strazy', 'començament', '#rusade', 'grandmother'.
### Key Observations
* The "Words in Cluster" column contains a mix of English words and words from other languages (Arabic, Russian, Spanish, etc.).
* Hashtags (#) are frequently present within the word lists, suggesting these words may have been extracted from social media or a similar platform.
* The clusters consistently contain the word "father" or related terms (padre, сын), indicating a strong thematic focus on familial relationships.
* The number of tokens per cluster varies, suggesting different levels of granularity or importance for each cluster.
### Interpretation
The data suggests an analysis of text related to the concept of "fatherhood" or family lineage. The presence of words in multiple languages indicates the source text is multilingual or deals with diverse cultural contexts. The hashtags suggest the data was likely scraped from a social media platform or a similar online source. The varying cluster sizes could reflect the frequency with which certain terms appear in relation to "father" or the degree to which they co-occur with it. The inclusion of seemingly unrelated words within some clusters (e.g., "Telegraph", "corona") might indicate noise in the clustering process or the presence of nuanced associations that require further investigation. The data could be used for sentiment analysis, topic modeling, or understanding cultural perceptions of fatherhood across different languages and communities.
</details>
Figure 2: Grouping of the token "father" at a few selected subspaces