# SMEC:Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
**Authors**: Hangzhou, China, &Bo Zheng, Beijing, China
Abstract
Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.
SMEC:Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
Biao Zhang, Lixin Chen, Tong Liu Taobao & Tmall Group of Alibaba Hangzhou, China {zb372670,tianyou.clx,yingmu}@taobao.com Bo Zheng Taobao & Tmall Group of Alibaba Beijing, China bozheng@alibaba-inc.com
1 Introduction
<details>
<summary>figures/fig_intr.png Details</summary>

### Visual Description
## Line Chart: NDCG@10 vs. Embedding Dimensions for Different Models
### Overview
The image is a line chart comparing the performance of different language models (LLM2Vec-7B and LLM2Vec-1B, with and without SMEC) based on their NDCG@10 scores across varying embedding dimensions. The x-axis represents embedding dimensions, while the y-axis represents the NDCG@10 score. The chart also indicates lossless dimension compression ratios for two of the models.
### Components/Axes
* **Title:** None explicitly present in the image.
* **X-axis:**
* **Label:** Embedding dimensions
* **Scale:** 128, 256, 512, 1024, 1536, 3584
* **Y-axis:**
* **Label:** NDCG@10
* **Scale:** 0.4, 0.5, 0.6, 0.7, 0.8, 0.9
* **Legend:** Located in the bottom-right corner of the chart.
* **LLM2Vec-7B:** Solid blue line with triangle markers.
* **LLM2Vec-7B (w/ SMEC):** Solid maroon line with triangle markers.
* **LLM2Vec-1B:** Dotted light blue line with circle markers.
* **LLM2Vec-1B (w/ SMEC):** Dotted dark blue line with circle markers.
### Detailed Analysis
* **LLM2Vec-7B (Solid Blue Line, Triangle Markers):**
* Trend: Slopes upward.
* Data Points:
* 128: 0.568
* 256: 0.648
* 512: 0.707
* 1024: 0.757
* 1536: 0.790
* 3584: 0.802
* **LLM2Vec-7B (w/ SMEC) (Solid Maroon Line, Triangle Markers):**
* Trend: Slopes upward.
* Data Points:
* 128: 0.772
* 256: 0.803
* 512: 0.832
* 1024: 0.844
* 1536: 0.852
* 3584: 0.862
* **LLM2Vec-1B (Dotted Light Blue Line, Circle Markers):**
* Trend: Slopes upward.
* Data Points:
* 128: 0.492
* 256: 0.576
* 512: 0.635
* 1024: 0.684
* 1536: 0.715
* 3584: Not explicitly labeled, but appears to be around 0.73.
* **LLM2Vec-1B (w/ SMEC) (Dotted Dark Blue Line, Circle Markers):**
* Trend: Slopes upward.
* Data Points:
* 128: 0.718
* 256: 0.743
* 512: 0.770
* 1024: 0.784
* 1536: 0.793
* 3584: 0.802
* **Annotations:**
* "~ 14x lossless dimension compression": Points to the LLM2Vec-7B (w/ SMEC) line.
* "~ 12 x lossless dimension compression": Points to the LLM2Vec-1B line.
### Key Observations
* The LLM2Vec-7B models (both with and without SMEC) consistently outperform the LLM2Vec-1B models across all embedding dimensions.
* For both the 7B and 1B models, using SMEC generally improves performance.
* The performance gains from increasing embedding dimensions appear to diminish as the dimensions increase, especially for the LLM2Vec-7B (w/ SMEC) model.
* The LLM2Vec-7B (w/ SMEC) model achieves the highest NDCG@10 scores.
### Interpretation
The chart demonstrates the impact of model size (7B vs. 1B), the use of SMEC, and embedding dimensions on the performance of language models, as measured by NDCG@10. The LLM2Vec-7B model with SMEC shows the best performance, suggesting that both a larger model size and the SMEC technique contribute to improved ranking quality. The annotations regarding lossless dimension compression suggest that these models can achieve good performance even with reduced dimensionality, potentially saving computational resources. The diminishing returns observed with increasing embedding dimensions indicate that there may be an optimal point beyond which further increases in dimensionality provide minimal performance gains.
</details>
Figure 1: The effectiveness of the SMEC in dimensionality reduction. After customized training with the SMEC method on BEIR Quora dataset, the embeddings of LLM2Vec-7B (3584 dimensions) and LLM2Vec-1B (1536 dimensions) can achieve 14 $×$ and 12 $×$ lossless compression, respectively.
<details>
<summary>figures/overview.png Details</summary>

### Visual Description
## Network Diagram: Adaptor Architectures
### Overview
The image presents three different adaptor architectures for neural networks: Search-Adaptor, Matryoshka-Adaptor, and SMEC(ours). Each architecture illustrates how input data flows through an encoder and subsequent layers, with variations in the structure and dimensionality of these layers.
### Components/Axes
* **Input:** Yellow rectangle at the top of each architecture, labeled "Input".
* **Encoder:** Blue trapezoid below the input, labeled "Encoder".
* **N:** Indicates the output of the encoder.
* **Fully Connected Layers:** Green rectangles representing fully connected layers with varying dimensions.
* **Sub Fully Connected Layer:** Yellow rectangle representing a sub fully connected layer.
* **Sub-sub Fully Connected Layer:** Blue rectangle representing a sub-sub fully connected layer.
* **Data Flow:** Green arrows indicate the direction of data flow.
* **Output Layers:** Represented by stacked blocks of different colors (red, gray, blue, orange) with corresponding labels L(x1:N/8), L(x1:N/4), L(x1:N/2), and L(x1:N).
* **Matryoshka Doll Analogy:** Top-right corner shows Matryoshka dolls to illustrate the concept of nested structures in Matryoshka-Adaptor and SMEC.
### Detailed Analysis
#### (a) Search-Adaptor
* **Input** flows into the **Encoder**, producing an output of size **N**.
* This output is fed into a **Fully Connected Layer x 4**. This layer is split into four sub-layers with dimensions: MxN, Mx(N/2), Mx(N/4), and Mx(N/8).
* The outputs of these sub-layers are then processed individually, resulting in output layers labeled as follows:
* Red block: L(x1:N/8)
* Gray block: L(x1:N/4)
* Blue block: L(x1:N/2)
* Orange block: L(x1:N)
#### (b) Matryoshka-Adaptor
* **Input** flows into the **Encoder**, producing an output of size **N**.
* This output is fed into a **Fully Connected Layer x 1** with dimensions MxN.
* The output is then split and processed, resulting in output layers labeled as follows:
* Red block: L(x1:N/8)
* Gray block: L(x1:N/4)
* Blue block: L(x1:N/2)
* Orange block: L(x1:N)
* The combined output is labeled L(x).
#### (c) SMEC(ours)
* **Input** flows into the **Encoder**, producing an output of size **N**.
* This output is fed into a **Fully Connected Layer x 1** with dimensions MxN (Step 1).
* The output is then split and processed, resulting in output layers labeled as follows:
* Blue block: L(x1:N/2)
* Orange block: L(x1:N)
* The combined output is labeled L1(x).
* A **Sub Fully Connected Layer** with dimensions MxN/2 is applied (Step 2).
* The output is then split and processed, resulting in output layers labeled as follows:
* Gray block: L(x1:N/4)
* Blue block: L(x1:N/2)
* The combined output is labeled L2(x).
* A **Sub-sub Fully Connected Layer** with dimensions MxN/4 is applied (Step 3).
* The output is then split and processed, resulting in output layers labeled as follows:
* Red block: L(x1:N/8)
* Gray block: L(x1:N/4)
* The combined output is labeled L3(x).
* The top-right corner shows Matryoshka dolls to illustrate the concept of nested structures in SMEC.
### Key Observations
* The Search-Adaptor uses parallel fully connected layers, while Matryoshka-Adaptor and SMEC use a single fully connected layer with sequential processing.
* SMEC(ours) introduces sub-layers to further refine the output.
* The Matryoshka doll analogy visually represents the nested structure of the Matryoshka-Adaptor and SMEC architectures.
### Interpretation
The diagrams illustrate different approaches to adapting neural networks. The Search-Adaptor explores multiple parallel pathways, while the Matryoshka-Adaptor and SMEC architectures focus on hierarchical processing. SMEC(ours) appears to be a more refined version of the Matryoshka-Adaptor, incorporating multiple sub-layers to achieve a more granular output. The Matryoshka doll analogy emphasizes the recursive and nested nature of these architectures, suggesting a design that progressively refines the input data.
</details>
Figure 2: Illustration of embedding compression architectures and our proposed approach. (a) presents the direct feature dimensionality reduction performed by the Search-Adaptor using FC layers. (b) illustrates the Matryoshka-Adaptor, which employs a shared set of FC layers to generate low-dimensional embeddings with multiple output dimensions. A Matryoshka-like hierarchical inclusion relationship exists between the high- and low-dimensional embeddings. (c) presents our proposed Sequential Matryoshka Embedding Compression (SMEC) framework, which adopts a sequential approach to progressively reduce high-dimensional embeddings to the target dimension. The animated diagram in the upper-right corner vividly highlights the distinction between Matryoshka-Adaptor and SMEC.
Large language models excel in diverse text tasks due to their ability to capture nuanced linguistic structures and contextual dependencies. For instance, GPT-4 achieves state-of-the-art performance on benchmarks like GLUE Wang et al. (2018) and SuperGLUE Wang et al. (2019), demonstrating its proficiency in tasks such as natural language inference (NLI), question answering (QA), and text classification. This success is attributed to their transformer-based architectures Vaswani et al. (2017), which enable parallel processing of sequential data and capture long-range dependencies through self-attention mechanisms. Similarly, Llama-3 Grattafiori et al. (2024) and ChatGPT Brown et al. (2020) leverage similar principles to achieve comparable or superior performance in domain-specific and multi-lingual tasks.
LLMs are increasingly integrated into commercial information retrieval (IR) systems, such as search engines (e.g., Google’s MUM) and recommendation platforms (e.g., Netflix’s content retrieval). Their ability to generate embeddings for long documents (e.g., books, research papers) and dynamic queries (e.g., conversational search) makes them indispensable for modern applications. For example, the BEIR benchmark Thakur et al. (2021) evaluates cross-domain retrieval performance, where LLMs outperform traditional BM25 Robertson and Walker (1994) and BERT-based models Devlin et al. (2019) by leveraging contextual embeddings.
While LLMs’ high-dimensional embeddings enable sophisticated semantic modeling, their storage and computational costs hinder scalability. Embedding dimensions of LLMs typically range from 1,024 (e.g., GPT-3) to 4,096 (e.g., Llama-3), exacerbating storage overhead and computational inefficiency—especially in real-time systems requiring dynamic updates. Moreover, high-dimensional vectors degrade the performance of retrieval algorithms due to the curse of dimensionality Beyer et al. (1999). For example: exact nearest-neighbor search in high-dimensional spaces becomes computationally infeasible, necessitating approximate methods like FAISS Johnson et al. (2017) or HNSW Yury et al. (2018). Even with optimizations, query latency increases exponentially with dimensionality, limiting responsiveness in real-world applications.
To address these challenges, Matryoshka Representation Learning (MRL) Kusupati et al. (2022) encodes multi-scale information into a single embedding, balancing task complexity and efficiency. It achieves strong results in large-scale classification and retrieval tasks and has inspired variants like Matryoshka-Adaptor Yoon et al. (2024), which offers a scalable framework for transforming embeddings into structured representations with Matryoshka properties under both supervised and unsupervised settings. However, MRL’s multi-scale parallel training strategy simultaneously limits its practical application in industry. When the retrieval system requires a new low-dimensional embedding, retraining from scratch is necessary to achieve effective dimensionality reduction.
In this paper, we systematically analyze the limitations of MRL and its variants in embedding compression and propose three key enhancements: (1) a continued-training-friendly training framework named Sequential Matryoshka Representation Learning (SMRL); (2) an adaptive dimension selection (ADS) mechanism to minimize information degradation during dimension pruning; and (3) a Selectable Cross-batch Memory (S-XBM) strategy to enhance unsupervised learning between high- and low-dimensional embeddings.
2 Related Work
2.1 Matryoshka representation learning
Matryoshka representation learning introduces a novel paradigm where embeddings are pretrained to inherently support progressive dimension truncation. This enables fine-grained control over the trade-off between computational latency (via reduced dimensionality) and accuracy (via retained semantic structure). Key innovations include the design of Matryoshka properties, such as hierarchical information encoding and intra-cluster compactness, which ensure that even truncated embeddings retain utility for downstream tasks.
In addition to representation learning, the concept of MRL have been applied to image generation, such as Matryoshka Diffusion Models (MDM) Gu et al. (2023); multimodal content understanding, such as $M^{3}$ Cai et al. (2024); and Multimodal Large Language Model (MLLM), such as Matryoshka Query Transformer (MQT) Hu et al. (2024).
2.2 Embedding Compression
Embedding compression aims to reduce the computational and memory footprint of neural network models or embeddings while preserving their utility for downstream tasks. This objective has driven research across multiple paradigms, each addressing different trade-offs between compression efficiency, performance retention, and adaptability. Early approaches primarily focused on unsupervised techniques based on linear algebra, such as Principal Component Analysis (PCA) Jolliffe and Cadima (2016), Linear Discriminant Analysis (LDA) Mclachlan , and Non-negative Matrix Factorization (NMF) Lee and Seung (2000). Building upon these, autoencoders and their variants, such as Variational Autoencoders (VAEs) Kingma et al. (2013), have gradually emerged as powerful tools for nonlinear dimensionality reduction, capable of capturing complex data distributions. With the development of deep learning, methods such as Contrastive Predictive Coding (CPC) Oord et al. (2018) and Momentum Contrast (MoCo) He et al. (2020) are capable of learning robust and compact representations from unlabeled data.
Recently, customized methods such as Search-Adaptor Yoon et al. (2023) and Matryoshka-Adaptor Yoon et al. (2024) have emerged as a new trend in embedding compression. They achieve significant dimensionality reduction by adding only a small number of parameters to the original representation model and retraining it on specific data.
3 Method
3.1 Rethinking MRL for embedding compression
MRL employs a nested-dimensional architecture to train models that learn hierarchical feature representations across multiple granularities. This allows adaptive deployment of models based on computational constraints. Specifically, MRL defines a series of models $f_{1},f_{2},...,f_{M}$ that share identical input and output spaces but progressively expand their hidden dimensions.
The term Matryoshka derives from the hierarchical parameter structure where the parameters of model $f_{m}$ are nested within those of its successor $f_{m+1}$ . To illustrate, consider a FC layer within the largest model $f_{M}$ , which contains $d_{M}$ neurons in its hidden layer. Correspondingly, the FC layer of $f_{m}$ retains the first $d_{m}$ neurons of this structure, with dimensions satisfying $d_{1}≤ d_{2}≤...≤ d_{M}$ . MRL jointly trains these models using the following objective:
$$
\sum_{m=1}^{M}c_{m}\cdot\mathcal{L}(f_{m}(\mathbf{x});y), \tag{1}
$$
where $\mathcal{L}$ denotes the loss function, $y$ represents the ground-truth label, and $c_{m}$ are task-specific weighting coefficients. Notably, each training iteration requires forward and backward propagation for all $M$ models, resulting in substantial computational overhead compared to training a single standalone model. Upon convergence, MRL enables flexible inference by selecting any intermediate dimension $d_{i}≤ d_{M}$ , thereby accommodating diverse computational constraints.
Although the MRL method has partially mitigated the performance degradation of representations during dimensionality reduction, we contend that it still faces the following three unresolved issues:
Gradient Fluctuation. In large-scale vector retrieval systems, sample similarity is measured by the distance between their representation vectors. Consequently, the optimization of embedding models typically employs loss functions based on embedding similarity. In this condition, according to the derivation in Appendix A, the loss function $\mathcal{L}^{d}$ of MRL under dimension $d$ satisfies the following relationship with respect to the parameter $\mathbf{w}_{i}$ in the $i$ -th dimension of the FC layer:
$$
\frac{\partial\mathcal{L}^{d}}{\partial\mathbf{w}_{i}}\propto\frac{1}{\delta(d)^{2}}. \tag{2}
$$
Here, $\delta(d)$ is a complex function that is positively correlated with the dimension $d$ . This equation provides a mathematical foundation for analyzing gradient fluctuations in multi-dimensional joint optimization architecture. It indicates that during the MRL training process, loss functions from various dimensions result in gradients of varying magnitudes on the same model parameter, thereby increasing gradient variance. In Section 5.2, we empirically demonstrated that the conclusion above is applicable to different loss functions. We propose a solution to resolve the aforementioned problem in Section 3.2.
Information Degradation. Neural network parameters exhibit heterogeneous contributions to model performance, as demonstrated by the non-uniform distribution of their gradients and feature importance metrics Frankle and Carbin (2018). The MRL method employs a dimension truncation strategy (e.g., $D→ D/2→ D/4...$ ) to prune parameters and reduce feature dimensions by retaining partial parameters. However, this approach fails to adequately preserve critical parameters because it relies on a rigid, static truncation rule. Although MRL employs joint training of high- and low-dimensional vectors to redistribute information between truncated and retained parameters, this process is unavoidably accompanied by information degradation. Specifically, discarded parameters may contain essential information, such as unique feature mappings or high-order dependencies, that cannot be effectively recovered by the remaining ones. Empirical evidence, such as accuracy degradation and increased generalization gaps, demonstrates that such loss leads to suboptimal model performance and slower convergence Li et al. (2023). In summary, while MRL enables hierarchical dimensionality reduction, its inability to selectively retain critical parameters and the inherent information degradation during post-truncation training ultimately undermine its effectiveness in maintaining model performance. In Section 3.3, we propose a more effective dimension pruning method.
<details>
<summary>figures/fig_ads.png Details</summary>

### Visual Description
## Diagram: Dimension Pruning Process
### Overview
The image illustrates a dimension pruning process, likely within a machine learning or neural network context. It shows how an initial set of features or dimensions is reduced through a series of operations, ultimately leading to a rank loss calculation. The diagram is split into two parallel paths, both starting with a similar initial state but undergoing different pruning strategies.
### Components/Axes
* **Input:** The process begins with a "Learnable Parameter" block, followed by "gumbel_softmax + topk select" operation, resulting in a selection of dimensions represented as "[1,3,..., X<sub>i-2</sub>, X<sub>i-1</sub>]".
* **Dimension Representation:** The dimensions are visually represented as stacked horizontal bars, with varying shades of green, labeled from X<sub>1</sub> to X<sub>i</sub>. The shading appears to indicate some form of weighting or importance, with darker shades potentially representing higher importance.
* **Dimension Pruning:** One path proceeds directly to a "D dim" representation, while the other path undergoes "Dimension Pruning" before reaching a "D/2 dim" representation.
* **Output:** Both paths culminate in a "Rank Loss" calculation.
* **Arrows:** Blue arrows indicate the flow of data and operations. A gray arrow indicates a direct selection of dimensions.
### Detailed Analysis
**Top Path:**
1. **Initial State:** Starts with "Learnable Parameter" and applies "gumbel_softmax + topk select" resulting in a selection of dimensions "[1,3,..., X<sub>i-2</sub>, X<sub>i-1</sub>]".
2. **Dimension Stack:** A stack of horizontal bars represents the dimensions X<sub>1</sub> to X<sub>i</sub>. The bars are shaded in a gradient from light green (top) to dark green (bottom).
* X<sub>1</sub> is the lightest shade of green.
* X<sub>i-2</sub> and X<sub>i-1</sub> are darker shades of green.
* X<sub>i</sub> is the darkest shade of green.
3. **D dim Representation:** The stack of dimensions is directly transformed into a "D dim" representation, visualized as three stacked orange blocks.
4. **Rank Loss:** The "D dim" representation is then used to calculate "Rank Loss".
**Bottom Path:**
1. **Initial State:** Starts with a similar stack of dimensions X<sub>1</sub> to X<sub>i</sub>, with the same shading pattern as the top path.
* X<sub>1</sub> is the lightest shade of green.
* X<sub>i-2</sub> and X<sub>i-1</sub> are darker shades of green.
* X<sub>i</sub> is the darkest shade of green.
2. **Dimension Pruning:** The stack undergoes "Dimension Pruning", resulting in a reduced stack of dimensions, labeled X<sub>1</sub>, X<sub>2</sub>, ..., X<sub>i-2</sub>, X<sub>i-1</sub>. The shading pattern is maintained.
3. **D/2 dim Representation:** The pruned stack is transformed into a "D/2 dim" representation, visualized as three stacked orange blocks.
4. **Rank Loss:** The "D/2 dim" representation is then used to calculate "Rank Loss".
### Key Observations
* The diagram highlights two different approaches to dimension reduction: one that directly uses the initial dimensions and another that prunes them before calculating rank loss.
* The shading of the dimension bars suggests a weighting or importance mechanism, potentially learned during training.
* The "Dimension Pruning" step reduces the dimensionality by half (D to D/2).
### Interpretation
The diagram illustrates a comparative analysis of two dimension reduction strategies within a machine learning pipeline. The top path represents a scenario where all initial dimensions are used to calculate rank loss, while the bottom path represents a scenario where dimensions are pruned before the rank loss calculation.
The "gumbel\_softmax + topk select" operation suggests a mechanism for selecting the most relevant dimensions based on learned parameters. The shading of the dimension bars likely reflects the importance assigned to each dimension by this selection process.
The "Dimension Pruning" step likely aims to reduce computational complexity and potentially improve generalization by removing less relevant dimensions. The comparison between the two paths allows for evaluating the impact of dimension pruning on the final rank loss.
The diagram suggests that the choice between using all dimensions or pruning them depends on the specific task and data characteristics. The optimal strategy would likely involve a trade-off between model complexity, computational cost, and generalization performance.
</details>
Figure 3: The ADS module introduces a set of learnable parameters to dynamically select dimensions based on their importance during the dimensionality reduction process.
Sample Selection. The MRL framework employs supervised learning to jointly train high-dimensional ( $D$ ) and low-dimensional ( $D^{\prime}$ ) features. However, the number of available samples is limited by manual annotation. Matryoshka-Adaptor introduces in-batch sample mining strategies to expand the training sample scale, thereby addressing the inherent limitation. Specifically, it generates cross-sample pairs via the cartesian product of batch samples:
$$
\mathcal{P}=\{(x_{i},x_{j})\mid x_{i},x_{j}\in\text{Batch},\ i\neq j\}. \tag{3}
$$
This approach creates $B(B-1)$ pairs per batch (where $B$ denotes the batch size), enabling cross-sample comparisons within large batches. However, this indiscriminate pairing introduces noise from non-representative or irrelevant sample pairs.
In light of this limitation, the method employs Top- $k$ similarity-based selection:
$$
\begin{split}\mathcal{P}_{\text{top-}k}&=\text{Top}_{k}\left(\text{similarity}(x_{i},x_{j})\right),\\
&\quad\forall\ (x_{i},x_{j})\in\mathcal{P}.\end{split} \tag{4}
$$
Here, only the top- $k$ most similar pairs are retained for training, reducing computational overhead while focusing on informative interactions. Despite this improvement, the diversity of effective samples remains fundamentally constrained by the original batch size $B$ . In Section 3.4, we develop a strategy that empowers the model to mine global sample beyond the current batch.
3.2 Sequential Matryoshka Representation Learning
Applying the conclusions from Section 3.1 to the MRL training process, and take the parallel dimensionality reduction process $[D,D/2,D/4]$ as an example. The ratio of the average gradients for parameters $\mathbf{w}_{i}(i∈[0,D/4])$ and $\mathbf{w}_{j}(j∈[D/4,D/2])$ is as follows:
$$
\displaystyle\overline{\text{grad}_{i}} \displaystyle:\overline{\text{grad}_{j}}=\left(\frac{\partial\mathcal{L}^{D}}{\partial\mathbf{w}_{i}}+\frac{\partial\mathcal{L}^{D/2}}{\partial\mathbf{w}_{i}}+\frac{\partial\mathcal{L}^{D/4}}{\partial\mathbf{w}_{i}}\right) \displaystyle:\left(\frac{\partial\mathcal{L}^{D}}{\partial\mathbf{w}_{i}}+\frac{\partial\mathcal{L}^{D/2}}{\partial\mathbf{w}_{i}}\right)\approx 1+\frac{\delta(D/2)^{2}}{\delta(D/4)^{2}}. \tag{5}
$$
As shown in Equation 3.2, the average gradient magnitude of parameter $\mathbf{w}_{i}$ can be approximated as $1+\frac{\delta(D/2)^{2}}{\delta(D/4)^{2}}$ times that of parameter $\mathbf{w}_{j}$ , primarily due to the influence of the lower-dimensional loss function $\mathcal{L}^{D/4}$ . To resolve this issue, we propose Sequential Matryoshka Representation Learning (SMRL), which substitutes the original parallel compression of embeddings with a sequential approach, as illustrated in the Figure 2. Assuming a dimensionality reduction trajectory of $[D,D/2,D/4,...,D/2^{n}]$ . In each iteration, only the immediate transition (e.g., $D/2^{n-1}→ D/2^{n}$ ) is trained, avoiding the inclusion of lower-dimensional losses that amplify gradients for low-dimensional parameters. By eliminating the above factor, the gradients of $\mathbf{w}_{i}(i∈[0,D/2^{n}])$ follow a consistent distribution with reduced variance, improving convergence speed and performance. Once the loss converges in the current iteration, the dimensionality reduction $D/2^{n-1}→ D/2^{n}$ is complete, and the process proceeds to the next stage $D/2^{n}→ D/2^{n+1}$ , repeating the procedure until the target dimension is reached. Additionally, after convergence in one iteration, the optimal parameters for the current dimension are fixed to prevent subsequent reductions from degrading their performance. Notably, compared to MRL, the SMRL framework is more amenable to continued training. In scenarios where low-dimensional retrieval embeddings (e.g., D/8) or intermediate embeddings (e.g., D/3) are required, these can be obtained through further dimensionality reduction training based on the already preserved D/4 or D/2 parameters, eliminating the need for retraining from scratch as is typically required in MRL.
3.3 Adaptive Dimension Selection Module
Since directly truncating dimensions to obtain low-dimensional representations in MRL inevitably leads to information degradation, we propose the Adaptive Dimension Selection (ADS) module to dynamically identify important dimensions during training. As illustrated in Figure 3, we introduce a set of learnable parameters that represent the importance of different dimensions in the original representation $\mathbf{Z}(\text{dim}=D)$ , and use these parameters to perform dimensional sampling, obtaining a reduced-dimension representation $\mathbf{Z}^{\prime}(\text{dim}=D/2)$ . Since the sampling operation is non-differentiable, during the training phase, we utilize the Gumbel-Softmax Jang et al. (2016) to approximate the importance of different dimensions. This is achieved by adding Gumbel-distributed noise $G\sim\text{Gumbel}(0,1)$ to the logits parameters $\hat{\mathbf{z}}$ for each dimension, followed by applying the softmax function to the perturbed logits to approximate the one-hot vector representing dimension selection. Mathematically, this can be expressed as:
$$
\mathbf{z}=\text{softmax}_{\tau}(\hat{\mathbf{z}}+G). \tag{6}
$$
Importantly, the Gumbel approximation allows the softmax scores of dimension importance to be interpreted as the probability of selecting each dimension, rather than enforcing a deterministic selection of the top- $k$ dimensions. This achieves a fully differentiable reparameterization, transforming the selection of embedding dimensions into an optimizable process.
<details>
<summary>figures/fig3.png Details</summary>

### Visual Description
## Diagram: XBM Process Flow
### Overview
The image is a diagram illustrating a process flow involving a "Frozen model," an "XBM" (presumably a memory bank), a "Score matrix," a "Top-k sample" selection, an "FC Layer," and a "Pair-based loss" calculation. The diagram shows how data flows between these components.
### Components/Axes
* **Frozen model:** A gray rectangle on the left, representing a pre-trained model.
* **XBM:** Two dashed rectangles labeled "XBM". The top XBM contains 10 horizontal rectangles, with the top two colored yellow, the next four colored light blue, and the bottom four colored light green. The bottom XBM contains three vertical rectangles colored light green. Arrows indicate "enqueue" into the bottom XBM and "dequeue" from the top XBM.
* **Score matrix:** A 4x4 grid in the center of the diagram.
* **Top-k sample:** A dashed rectangle containing 6 vertical rectangles, with the left two colored light green and the right four colored light blue.
* **FC Layer:** A light green rectangle.
* **Pair-based loss:** A rounded light green rectangle on the right.
* **Arrows:** Blue arrows indicate the direction of data flow between components.
### Detailed Analysis or ### Content Details
1. **Frozen model:** The process starts with the "Frozen model."
2. **Enqueue:** The output of the "Frozen model" is enqueued into the bottom "XBM" which contains three light green vertical rectangles.
3. **XBM (Memory Bank):** The "XBM" is represented by two dashed rectangles. The bottom one shows the enqueue operation, and the top one shows the dequeue operation. The top XBM contains 10 horizontal rectangles, with the top two colored yellow, the next four colored light blue, and the bottom four colored light green.
4. **Score matrix:** The output from the top "XBM" is multiplied (indicated by a green circle with an "X" inside) with another input to generate a "Score matrix," which is a 4x4 grid.
5. **Top-k sample:** A "Top-k sample" is selected from the "Score matrix." The "Top-k sample" is represented by a dashed rectangle containing 6 vertical rectangles, with the left two colored light green and the right four colored light blue.
6. **FC Layer:** The "Top-k sample" is then passed through an "FC Layer."
7. **Pair-based loss:** Finally, a "Pair-based loss" is calculated.
### Key Observations
* The "XBM" acts as a memory bank, with enqueue and dequeue operations.
* The "Score matrix" is a 4x4 grid, suggesting a comparison or relationship between 4 different entities.
* The colors of the rectangles in the "XBM" and "Top-k sample" change from light green to light blue to yellow, possibly indicating different stages or levels of processing.
### Interpretation
The diagram illustrates a machine learning process where a "Frozen model" generates embeddings that are stored in a memory bank ("XBM"). These embeddings are then used to compute a "Score matrix," from which "Top-k samples" are selected. These samples are then passed through a fully connected layer ("FC Layer") and used to calculate a "Pair-based loss." The "XBM" likely serves to store and retrieve embeddings for contrastive learning or similar tasks. The enqueue and dequeue operations suggest a dynamic memory management strategy. The colors in the XBM and Top-k sample may represent different levels of confidence or relevance.
</details>
Figure 4: S-XBM maintains a queue during training to store historical features across batches. Rather than incorporating all stored features into the current batch, it selectively leverages hard samples that exhibit high similarity to the current batch samples.
3.4 Selectable Cross-Batch Memory
A natural teacher-student relationship inherently exists between the original embedding and its reduced-dimensional counterpart, making it feasible to improve the compressed embedding through unsupervised learning Yoon et al. (2024). However, as discussed in Section 3.1, performing this process within a single batch suffers from sample noise and insufficient diversity. As illustrated in Figure 4, we propose the Selectable Cross-Batch Memory (S-XBM) module, which constructs a first-in-first-out (FIFO) queue during training to store original embeddings across batches, with the aim of addressing this limitation. Unlike the original XBM Wang et al. (2020), we introduce two task-specific improvements: (1) retrieving only the top‑ $k$ most similar samples from the memory bank to construct new batches, and (2) deferring the trainable FC layer and only storing features generated by the frozen backbone, thereby avoiding feature drift. The unsupervised loss between original embedding $emb$ and low-dimensional embedding $emb[:d]$ is as follows:
$$
\displaystyle{\mathcal{L}_{un-sup}} \displaystyle=\sum_{i}\sum_{j\in\mathcal{N}_{K}(i)}\left|\text{Sim}(emb_{i},emb_{j})\right. \displaystyle\quad\left.-\text{Sim}(emb_{i}[:d],emb_{j}[:d])\right| \tag{7}
$$
where $\mathcal{N}_{K}(i)$ denotes the set of the top $k$ most similar embeddings to $emb_{i}$ within the S-XBM module.
<details>
<summary>figures/fig_openai.png Details</summary>

### Visual Description
## Line Chart: NDCG@10/% vs. Dimensions
### Overview
The image is a line chart comparing the performance of four different models (Original(MRL), search-adaptor, MRL-Adaptor, and SMEC) based on their NDCG@10/% scores across varying dimensions. The x-axis represents the dimensions, and the y-axis represents the NDCG@10/% score.
### Components/Axes
* **X-axis:** Dimensions, with values 128, 256, 512, 768, 1536, and 3072.
* **Y-axis:** NDCG@10/%, ranging from 50 to 62 with increments of 2.
* **Legend:** Located in the bottom-right corner, it identifies each model with a specific color and marker:
* Blue with circle marker: Original(MRL)
* Orange with cross marker: search-adaptor
* Green with triangle marker: MRL-Adaptor
* Yellow/Orange with square marker: SMEC
### Detailed Analysis
* **Original(MRL) (Blue, Circle):** The line starts at approximately 49.3 at dimension 128, increases to approximately 53.9 at 256, then to 55.5 at 512, 56.1 at 768, 56.6 at 1536, and finally reaches approximately 57 at 3072. The trend is generally upward, but the rate of increase slows down as dimensions increase.
* **search-adaptor (Orange, Cross):** The line starts at approximately 51.5 at dimension 128, increases to approximately 57.2 at 256, then to 59.3 at 512, 60.1 at 768, 61 at 1536, and finally reaches approximately 61.8 at 3072. The trend is upward, with a steeper initial increase.
* **MRL-Adaptor (Green, Triangle):** The line starts at approximately 54.6 at dimension 128, increases to approximately 58.4 at 256, then to 59.7 at 512, 60.4 at 768, 61.2 at 1536, and finally reaches approximately 61.8 at 3072. The trend is upward, with a moderate initial increase.
* **SMEC (Yellow/Orange, Square):** The line starts at approximately 56.7 at dimension 128, increases to approximately 59.4 at 256, then to 60.4 at 512, 60.8 at 768, 61.5 at 1536, and finally reaches approximately 61.9 at 3072. The trend is upward, with a relatively high initial value.
### Key Observations
* SMEC consistently performs better than the other models up to dimension 1536.
* The Original(MRL) model has the lowest NDCG@10/% scores across all dimensions.
* All models show diminishing returns as dimensions increase, with the curves flattening out.
* At higher dimensions (1536 and 3072), the performance of search-adaptor, MRL-Adaptor, and SMEC converges.
### Interpretation
The chart illustrates the impact of dimensionality on the performance of different models, as measured by NDCG@10/%. The Original(MRL) model appears to be less effective at higher dimensions compared to the other three models. The search-adaptor, MRL-Adaptor, and SMEC models show similar performance at higher dimensions, suggesting that increasing dimensions beyond a certain point may not significantly improve their performance. The diminishing returns observed across all models indicate that there may be an optimal dimensionality for these models, beyond which further increases provide minimal benefit.
</details>
(a) OpenAI text embeddings
<details>
<summary>figures/fig_llm2vec.png Details</summary>

### Visual Description
## Line Chart: NDCG@10/% vs. Dimensions
### Overview
The image is a line chart comparing the performance of four different methods (Original(PCA), Search-Adaptor, MRL-Adaptor, and SMEC) based on the metric NDCG@10/% across varying dimensions. The x-axis represents the number of dimensions, and the y-axis represents the NDCG@10/% score.
### Components/Axes
* **X-axis:** Dimensions, with values 128, 256, 512, 768, 1536, and 3584.
* **Y-axis:** NDCG@10/%, ranging from 30 to 60 with increments of 5.
* **Legend:** Located in the bottom-right corner, it identifies each line by color and method:
* Blue line with circle markers: Original(PCA)
* Orange line with cross markers: Search-Adaptor
* Green line with triangle markers: MRL-Adaptor
* Yellow line with square markers: SMEC
### Detailed Analysis
* **Original(PCA) (Blue):** The line starts at approximately 31 at 128 dimensions and increases steadily to approximately 56 at 3584 dimensions.
* (128, 31)
* (256, 41)
* (512, 46.5)
* (768, 50.5)
* (1536, 53.5)
* (3584, 55.5)
* **Search-Adaptor (Orange):** The line starts at approximately 51 at 128 dimensions, increases to approximately 57 at 256 dimensions, and then gradually increases to approximately 61 at 3584 dimensions.
* (128, 51)
* (256, 57)
* (512, 58.5)
* (768, 59)
* (1536, 60)
* (3584, 61)
* **MRL-Adaptor (Green):** The line starts at approximately 54 at 128 dimensions, increases to approximately 59 at 256 dimensions, and then gradually increases to approximately 61 at 3584 dimensions.
* (128, 54)
* (256, 59)
* (512, 60)
* (768, 60.5)
* (1536, 60.5)
* (3584, 61.5)
* **SMEC (Yellow):** The line starts at approximately 56 at 128 dimensions, increases to approximately 59 at 256 dimensions, and then gradually increases to approximately 61.5 at 3584 dimensions.
* (128, 56)
* (256, 59)
* (512, 60)
* (768, 60.5)
* (1536, 61)
* (3584, 61.5)
### Key Observations
* Original(PCA) has the lowest NDCG@10/% scores across all dimensions compared to the other three methods.
* Search-Adaptor, MRL-Adaptor, and SMEC perform similarly, with SMEC showing a slightly higher NDCG@10/% at higher dimensions.
* The performance of all methods generally improves as the number of dimensions increases, but the rate of improvement decreases at higher dimensions.
### Interpretation
The chart suggests that increasing the number of dimensions generally improves the performance of all methods, as measured by NDCG@10/%. However, the Original(PCA) method consistently underperforms compared to the other three methods. The Search-Adaptor, MRL-Adaptor, and SMEC methods achieve similar performance, with SMEC showing a slight advantage at higher dimensions. This indicates that these methods are more effective at capturing relevant information as the dimensionality increases, while Original(PCA) may be limited by its dimensionality reduction approach. The diminishing returns observed at higher dimensions suggest that there may be a point beyond which increasing dimensionality provides little additional benefit.
</details>
(b) LLM2Vec
Figure 5: Experimental results on the BEIR dataset comparing two models: OpenAI’s text-embedding-3-large (with 3072 dimensions) and LLM2Vec (with 3548 dimensions), the latter built upon the Qwen2-7B model. OpenAI text embeddings inherently contain multi-scale representations (enabled by MRL during pretraining), while LLM2Vec obtains its orignal low-dimensional representations via PCA.
4 Experiments
In this section, we compare our approach with state-of-the-art methods in the field of embedding dimensionality reduction.
4.1 Dataset Description
We evaluate the model’s retrieval performance across diverse datasets: BEIR Thakur et al. (2021) (text retrieval), Products-10K Bai et al. (2020) (image retrieval), and Fashion-200K Han et al. (2017) (cross-modal retrieval). BEIR is a comprehensive text retrieval benchmark consisting of 13 selected datasets from diverse domains. Products-10K contains approximately 10,000 products with over 150,000 images for large-scale product image retrieval. Fashion-200K includes over 200,000 fashion items with paired image-text data for cross-modal tasks.
4.2 Implementation Details
We use state-of-the-art models to extract the original embeddings for different datasets. Specifically, the BEIR dataset employs OpenAI text embeddings ope and LLM2Vec BehnamGhader et al. (2024) for text representation; the Products-10K dataset utilizes LLM2CLIP Huang et al. (2024) to obtain cross-modal embeddings; and the Fashion-200K dataset extracts image embeddings using the ViT-H Dosovitskiy et al. (2020) model. All dimensionality reduction methods are performed based on these original representations. To align with other methods, SMEC also adopts rank loss Yoon et al. (2023) as the supervised loss function, which is defined as follows:
$$
\displaystyle\mathcal{L}_{rank} \displaystyle=\sum_{i}\sum_{j}\sum_{k}\sum_{m}I(y_{ij}>y_{ik})(y_{ij}-y_{ik}) \displaystyle\log(1+\exp(s_{ik}[:m]-s_{ij}[:m])), \tag{8}
$$
where $I(y_{ij}>y_{ik})$ is an indicator function that is equal to 1 if $y_{ij}$ > $y_{ik}$ and 0 otherwise. $s_{ij}[:m]$ represents the cosine similarity between the query embedding $emb_{i}[:m]$ and the corpus embedding $emb_{j}[:m]$ . The total loss function is:
$$
\displaystyle\mathcal{L}_{total}=\mathcal{L}_{rank}+\alpha\mathcal{L}_{un-sup}, \tag{9}
$$
with $\alpha$ being hyper-parameters with fixed values as $\alpha=1.0$ . As SMEC involves multi-stage training, the training epochs of other methods are aligned with the total number of epochs costed by SMEC, and their best performance is reported.
4.3 Results
In this subsection, the results on the BEIR, Fashion-200K, and Products-10K datasets are given. Retrieval performance is evaluated using the normalized discounted cumulative gain at rank 10 (nDCG@10) Kalervo et al. (2002) metric.
BEIR. As shown in Figure 5, we compare the performance of SMEC and other state-of-the-art methods on two types of models: the API-based OpenAI text embedding and the open-source LLM2vec, across various compressed dimensions. Significantly, SMEC exhibits the strongest performance retention, particularly at lower compression ratios. For example, when compressed to 128 dimensions, SMEC improves the performance of the OpenAI and LLM2vec models by 1.9 and 1.1 points respectively, compared to the best-performing Matryoshka-Adaptor.
Products-10K. Images naturally contain denser features than text O Pinheiro et al. (2020). As shown in Figure 8(a) of Appendix C, SMEC surpasses other dimensionality reduction methods in image retrieval tasks, highlighting the effectiveness of the ADS module in mitigating information degradation during dimension pruning.
Fashion-200K. Unlike unimodal datasets, Fashion-200K involves cross-modal queries and documents, such as image-to-text and text-to-image retrieval. As illustrated in the Figure 8(b) and 8(c) of Appendix C, SMEC achieves superior performance in both directions, demonstrating strong robustness in multimodal scenarios.
<details>
<summary>figures/fig_var.png Details</summary>

### Visual Description
## Line Chart: Gradient Variance vs. Epochs for MRL and SMRL
### Overview
The image is a line chart comparing the gradient variance of two methods, MRL and SMRL, over 40 epochs. The y-axis represents gradient variance on a logarithmic scale, and the x-axis represents the number of epochs. Both MRL and SMRL show a decreasing trend in gradient variance as the number of epochs increases.
### Components/Axes
* **X-axis:** Epochs, with markers at 0, 5, 10, 15, 20, 25, 30, 35, and 40.
* **Y-axis:** Gradient Variance, with a logarithmic scale. Markers are at 10<sup>-7</sup>, 10<sup>-8</sup>, and 10<sup>-9</sup>.
* **Legend (Top-Right):**
* Black line: MRL
* Cyan line: SMRL
### Detailed Analysis
* **MRL (Black Line):**
* Trend: Decreasing gradient variance over epochs.
* At epoch 0, gradient variance is approximately 2.0 x 10<sup>-7</sup>.
* At epoch 10, gradient variance is approximately 3.0 x 10<sup>-8</sup>.
* At epoch 20, gradient variance is approximately 6.0 x 10<sup>-9</sup>.
* At epoch 30, gradient variance is approximately 1.5 x 10<sup>-9</sup>.
* At epoch 40, gradient variance is approximately 1.0 x 10<sup>-9</sup>.
* **SMRL (Cyan Line):**
* Trend: Decreasing gradient variance over epochs.
* At epoch 0, gradient variance is approximately 8.0 x 10<sup>-8</sup>.
* At epoch 10, gradient variance is approximately 8.0 x 10<sup>-9</sup>.
* At epoch 20, gradient variance is approximately 2.0 x 10<sup>-9</sup>.
* At epoch 30, gradient variance is approximately 6.0 x 10<sup>-10</sup>.
* At epoch 40, gradient variance is approximately 4.0 x 10<sup>-10</sup>.
### Key Observations
* Both MRL and SMRL show a significant decrease in gradient variance during the first 10 epochs.
* SMRL consistently has a lower gradient variance than MRL across all epochs.
* The rate of decrease in gradient variance slows down after 20 epochs for both methods.
* Both lines appear to flatten out towards the end of the training (epochs 30-40).
### Interpretation
The chart suggests that both MRL and SMRL converge as the number of training epochs increases, indicated by the decreasing gradient variance. SMRL appears to converge faster and achieve a lower gradient variance than MRL, suggesting it might be a more stable or efficient optimization method for this particular task. The flattening of the curves towards the end indicates that both methods are approaching a point where further training yields diminishing returns in terms of reducing gradient variance. The lower gradient variance of SMRL suggests it might generalize better or be less prone to overfitting compared to MRL.
</details>
(a) Gradient Variance
<details>
<summary>figures/fig_loss.png Details</summary>

### Visual Description
## Line Chart: Loss vs. Epochs for MRL and SMRL
### Overview
The image is a line chart comparing the loss of two models, MRL and SMRL, over 40 epochs. Both models show a decreasing loss over time, with SMRL consistently exhibiting a lower loss than MRL. The chart includes a grid for easier reading and a legend in the top-right corner.
### Components/Axes
* **X-axis:** Epochs, ranging from 0 to 40 in increments of 5.
* **Y-axis:** Loss, ranging from 0.05 to 0.10 in increments of 0.01.
* **Legend (Top-Right):**
* MRL (Black line)
* SMRL (Cyan line)
### Detailed Analysis
* **MRL (Black Line):**
* Trend: The loss decreases sharply in the initial epochs and then gradually stabilizes.
* Data Points:
* Epoch 0: Approximately 0.102
* Epoch 5: Approximately 0.069
* Epoch 10: Approximately 0.062
* Epoch 15: Approximately 0.059
* Epoch 20: Approximately 0.059
* Epoch 25: Approximately 0.058
* Epoch 30: Approximately 0.056
* Epoch 35: Approximately 0.056
* Epoch 40: Approximately 0.056
* **SMRL (Cyan Line):**
* Trend: The loss decreases rapidly in the early epochs and then fluctuates slightly around a lower value.
* Data Points:
* Epoch 0: Approximately 0.090
* Epoch 5: Approximately 0.060
* Epoch 10: Approximately 0.054
* Epoch 15: Approximately 0.049
* Epoch 20: Approximately 0.049
* Epoch 25: Approximately 0.047
* Epoch 30: Approximately 0.047
* Epoch 35: Approximately 0.047
* Epoch 40: Approximately 0.047
### Key Observations
* Both MRL and SMRL models show a significant reduction in loss during the first few epochs.
* SMRL consistently outperforms MRL, achieving a lower loss across all epochs.
* The rate of loss reduction decreases over time for both models, indicating convergence.
* After approximately 20 epochs, the loss for both models fluctuates within a narrow range.
### Interpretation
The chart demonstrates the training performance of two models, MRL and SMRL, by tracking their loss over 40 epochs. The data suggests that SMRL is a more effective model, as it achieves a lower loss compared to MRL throughout the training process. The rapid initial decrease in loss for both models indicates effective learning, while the subsequent stabilization suggests that the models are approaching their optimal performance levels. The fluctuations in loss after 20 epochs could be due to the learning rate or inherent noise in the training data.
</details>
(b) Validation loss
<details>
<summary>figures/fig_ndcg.png Details</summary>

### Visual Description
## Line Chart: NDCG@10 vs Epochs for MRL and SMRL
### Overview
The image is a line chart comparing the performance of two models, MRL and SMRL, over 40 epochs. The y-axis represents NDCG@10, a metric for ranking quality, and the x-axis represents the number of epochs. The chart shows how the NDCG@10 score changes for each model as training progresses.
### Components/Axes
* **X-axis:** Epochs, ranging from 0 to 40 in increments of 5.
* **Y-axis:** NDCG@10, ranging from 0.41 to 0.47 in increments of 0.01.
* **Legend (top-left):**
* MRL (black line)
* SMRL (cyan line)
### Detailed Analysis
* **MRL (black line):**
* Trend: Initially decreases, then increases significantly, and finally plateaus.
* Data Points:
* Epoch 0: Approximately 0.442
* Epoch 10: Approximately 0.412
* Epoch 20: Approximately 0.426
* Epoch 30: Approximately 0.459
* Epoch 40: Approximately 0.458
* **SMRL (cyan line):**
* Trend: Decreases sharply, then increases significantly, and plateaus at a higher level than MRL.
* Data Points:
* Epoch 0: Approximately 0.440
* Epoch 10: Approximately 0.416
* Epoch 20: Approximately 0.434
* Epoch 30: Approximately 0.470
* Epoch 40: Approximately 0.472
### Key Observations
* Both models initially experience a decrease in NDCG@10.
* SMRL consistently outperforms MRL after approximately epoch 23.
* Both models plateau in performance after approximately epoch 30.
* SMRL reaches a higher peak NDCG@10 value than MRL.
### Interpretation
The chart suggests that both MRL and SMRL models require a certain number of training epochs before they start to improve their ranking performance. The initial decrease in NDCG@10 could be due to the models exploring the search space before converging on a better solution. SMRL appears to be a more effective model, as it achieves a higher NDCG@10 score and plateaus at a higher level than MRL. The plateauing of both models suggests that further training beyond 30 epochs may not significantly improve their performance. The difference in performance between MRL and SMRL could be due to differences in their architecture, training data, or hyperparameters.
</details>
(c) Retrieval performance
Figure 6: Analysis of metrics during the training process. (a) shows the gradient variance curve (with the vertical axis in a logarithmic scale), (b) presents the loss curve on the validation set, and (c) illustrates the performance variations on the test set. As training progresses, the gradient variances of both MRL and SMRL decrease; however, the gradient variance of MRL remains several times higher than that of SMRL. Consequently, the loss curve of SMRL converges more quickly to a lower value, and the compressed embedding demonstrates better retrieval performance.
5 Discussions
5.1 The influence of gradient variance
To validate the impact of gradient variance on convergence speed and model performance (as discussed in Section 3.2), we conducted comparative experiments between SMRL and MRL using the MiniLM model on the BEIR dataset. As shown in Figure 6(a), MRL consistently exhibits significantly higher gradient variance than SMRL throughout training. Consequently, the training loss of MRL continues to decline beyond the 20th epoch, whereas SMRL’s loss starts to converge at the 15th epoch. A similar trend is observed in subfigure 6(c), where SMRL enters the improvement phase earlier and converges to superior performance.
<details>
<summary>figures/fig_discuss2_rank.png Details</summary>

### Visual Description
## Bar and Line Chart: Gradient Size and Variance vs. Epochs
### Overview
The image is a combination of a bar chart and a line chart, displaying the average gradient and gradient variance for two different models (SMRL and MRL) across epochs. The x-axis represents epochs (0, 10, 20, 30), while the left y-axis (Gradient Size) is on a logarithmic scale. The right y-axis (var) represents the gradient variance, also on a logarithmic scale. The chart compares the performance of SMRL and MRL models by showing how their average gradient and gradient variance change over training epochs.
### Components/Axes
* **X-axis:** Epochs, with markers at 0, 10, 20, and 30.
* **Left Y-axis:** Gradient Size, logarithmic scale from 10^-2 to 10^0 (0.01 to 1).
* **Right Y-axis:** var (Gradient Variance), logarithmic scale from 10^-9 to 10^-7.
* **Legend (top-right):**
* Light Blue: Average Gradient (ω<sub>i,i∈[0,96]</sub>, SMRL)
* Blue: Average Gradient (ω<sub>j,j∈[96,192]</sub>, SMRL)
* Light Orange: Average Gradient (ω<sub>i,i∈[0,96]</sub>, MRL)
* Orange: Average Gradient (ω<sub>j,j∈[96,192]</sub>, MRL)
* Blue Line with Circle Markers: Gradient Variance (ω<sub>k,k∈[0,192]</sub>, SMRL)
* Brown Line with Square Markers: Gradient Variance (ω<sub>k,k∈[0,192]</sub>, MRL)
### Detailed Analysis
#### Average Gradient (Bar Chart)
* **SMRL (ω<sub>i,i∈[0,96]</sub>, Light Blue):**
* Epoch 0: 1.198
* Epoch 10: 0.063
* Epoch 20: 0.024
* Epoch 30: 0.015
* Trend: Decreasing
* **SMRL (ω<sub>j,j∈[96,192]</sub>, Blue):**
* Epoch 0: 1.054
* Epoch 10: 0.062
* Epoch 20: 0.026
* Epoch 30: 0.013
* Trend: Decreasing
* **MRL (ω<sub>i,i∈[0,96]</sub>, Light Orange):**
* Epoch 0: 1.188
* Epoch 10: 0.184
* Epoch 20: 0.081
* Epoch 30: 0.038
* Trend: Decreasing
* **MRL (ω<sub>j,j∈[96,192]</sub>, Orange):**
* Epoch 0: 3.666
* Epoch 10: 0.062
* Epoch 20: 0.025
* Epoch 30: 0.013
* Trend: Decreasing
#### Gradient Variance (Line Chart)
* **SMRL (ω<sub>k,k∈[0,192]</sub>, Blue Line):**
* Epoch 0: 8.84e-8
* Epoch 10: 8.97e-9
* Epoch 20: 1.48e-9
* Epoch 30: 5.78e-10
* Trend: Decreasing
* **MRL (ω<sub>k,k∈[0,192]</sub>, Brown Line):**
* Epoch 0: 3.07e-7
* Epoch 10: 2.27e-8
* Epoch 20: 4.93e-9
* Epoch 30: 1.87e-9
* Trend: Decreasing
### Key Observations
* The average gradient for both SMRL and MRL decreases significantly over the epochs.
* The gradient variance for both SMRL and MRL also decreases over the epochs.
* The initial gradient size at epoch 0 is higher for MRL (ω<sub>j,j∈[96,192]</sub>) compared to other average gradients.
* The gradient variance for MRL is higher than SMRL at the beginning, but both converge to lower values.
### Interpretation
The chart illustrates the training behavior of SMRL and MRL models, showing how the average gradient and gradient variance change as the models are trained over epochs. The decreasing trend in both average gradient and gradient variance suggests that the models are learning and converging towards a stable state. The higher initial gradient variance for MRL might indicate a more unstable initial state or a different learning dynamic compared to SMRL. The convergence of both models to lower gradient variance values indicates that both models are becoming more stable as training progresses. The data suggests that both models are effectively learning, with the gradient size and variance decreasing as the number of epochs increases.
</details>
(a) Rank Loss
<details>
<summary>figures/fig_discuss2_mse.png Details</summary>

### Visual Description
## Chart: Gradient Size and Variance vs. Epochs
### Overview
The image is a combination bar and line chart that displays the average gradient and gradient variance for two different models (SMRL and MRL) across several epochs. The y-axis on the left represents "Gradient Size" on a logarithmic scale, while the y-axis on the right represents "var" (variance) also on a logarithmic scale. The x-axis represents "Epochs".
### Components/Axes
* **X-axis:** Epochs, with tick marks at 0, 10, 20, and 30.
* **Left Y-axis:** Gradient Size, logarithmic scale from 10^-1 to 10^0 (1).
* **Right Y-axis:** var (variance), logarithmic scale from 10^-8 to 10^-5.
* **Legend (top-right):**
* Light Blue: Average Gradient (ω<sub>i,i∈[0,96]</sub>, SMRL)
* Blue: Average Gradient (ω<sub>j,j∈[96,192]</sub>, SMRL)
* Light Orange: Average Gradient (ω<sub>i,i∈[0,96]</sub>, MRL)
* Orange: Average Gradient (ω<sub>j,j∈[96,192]</sub>, MRL)
* Blue Line with Circle Markers: Gradient Variance (ω<sub>k,k∈[0,192]</sub>, SMRL)
* Brown Line with Square Markers: Gradient Variance (ω<sub>k,k∈[0,192]</sub>, MRL)
### Detailed Analysis
**Average Gradient (Bar Plots):**
* **SMRL (ω<sub>i,i∈[0,96]</sub>, Light Blue):**
* Epoch 0: 1.124
* Epoch 10: 0.088
* Epoch 20: 0.039
* Epoch 30: 0.023
* **SMRL (ω<sub>j,j∈[96,192]</sub>, Blue):**
* Epoch 0: 1.037
* Epoch 10: 0.083
* Epoch 20: 0.04
* Epoch 30: 0.023
* **MRL (ω<sub>i,i∈[0,96]</sub>, Light Orange):**
* Epoch 0: 1.51e-5
* Epoch 10: 2.43e-7
* Epoch 20: 4.68e-8
* Epoch 30: 6.51e-9
* **MRL (ω<sub>j,j∈[96,192]</sub>, Orange):**
* Epoch 0: 1.093
* Epoch 10: 0.18
* Epoch 20: 0.077
* Epoch 30: 0.062
**Gradient Variance (Line Plots):**
* **SMRL (ω<sub>k,k∈[0,192]</sub>, Blue Line):** The line slopes downward.
* Epoch 0: 1.51e-5
* Epoch 10: 9.88e-8
* Epoch 20: 2.09e-8
* Epoch 30: 6.51e-9
* **MRL (ω<sub>k,k∈[0,192]</sub>, Brown Line):** The line slopes downward.
* Epoch 0: 5.32e-5
* Epoch 10: 2.717
* Epoch 20: 2.43e-7
* Epoch 30: 4.68e-8
### Key Observations
* Both average gradients (SMRL and MRL) decrease as the number of epochs increases.
* Both gradient variances (SMRL and MRL) decrease as the number of epochs increases.
* The SMRL average gradients are significantly larger than the MRL average gradients.
* The SMRL gradient variance is initially lower than the MRL gradient variance, but both converge to similar values at higher epochs.
### Interpretation
The chart illustrates the training behavior of two models, SMRL and MRL, by tracking the average gradient and gradient variance over epochs. The decreasing gradient size suggests that both models are learning and converging towards a minimum. The decreasing gradient variance indicates that the models are becoming more stable and consistent in their updates. The SMRL model has larger average gradients, which could indicate faster initial learning, but the MRL model's lower initial variance might suggest better stability. The convergence of both models' variances at higher epochs implies that they eventually reach a similar level of stability.
</details>
(b) MSE Loss
<details>
<summary>figures/fig_discuss2_ce.png Details</summary>

### Visual Description
## Bar and Line Chart: Gradient Size vs. Epochs
### Overview
The image is a combination of a bar chart and a line chart, displaying the average gradient and gradient variance over epochs for two different models (SMRL and MRL). The x-axis represents epochs, while the left y-axis represents gradient size (logarithmic scale) and the right y-axis represents variance (also logarithmic scale).
### Components/Axes
* **X-axis:** Epochs, labeled "Epochs", with tick marks at 0, 10, 20, and 30.
* **Left Y-axis:** Gradient Size, labeled "Gradient Size", with a logarithmic scale ranging from 10<sup>-1</sup> to 10<sup>0</sup> (0.1 to 1).
* **Right Y-axis:** Variance, labeled "var", with a logarithmic scale ranging from 10<sup>-7</sup> to 10<sup>-4</sup>.
* **Legend (top-right):**
* Light Blue: Average Gradient (ω<sub>i,i∈[0,96]</sub>, SMRL)
* Blue: Average Gradient (ω<sub>j,j∈[96,192]</sub>, SMRL)
* Light Orange: Average Gradient (ω<sub>i,i∈[0,96]</sub>, MRL)
* Orange: Average Gradient (ω<sub>j,j∈[96,192]</sub>, MRL)
* Blue Line with Circle Markers: Gradient Variance (ω<sub>k,k∈[0,192]</sub>, SMRL)
* Brown Line with Square Markers: Gradient Variance (ω<sub>k,k∈[0,192]</sub>, MRL)
### Detailed Analysis or ### Content Details
**Bar Chart Data (Average Gradients):**
* **Epoch 0:**
* Average Gradient (ω<sub>i,i∈[0,96]</sub>, SMRL): 2.28
* Average Gradient (ω<sub>j,j∈[96,192]</sub>, SMRL): 2.298
* Average Gradient (ω<sub>i,i∈[0,96]</sub>, MRL): 7.17e-5
* Average Gradient (ω<sub>j,j∈[96,192]</sub>, MRL): 2.381
* **Epoch 10:**
* Average Gradient (ω<sub>i,i∈[0,96]</sub>, SMRL): 0.249
* Average Gradient (ω<sub>j,j∈[96,192]</sub>, SMRL): 0.255
* Average Gradient (ω<sub>i,i∈[0,96]</sub>, MRL): 9.97e-7
* Average Gradient (ω<sub>j,j∈[96,192]</sub>, MRL): 0.528
* **Epoch 20:**
* Average Gradient (ω<sub>i,i∈[0,96]</sub>, SMRL): 0.099
* Average Gradient (ω<sub>j,j∈[96,192]</sub>, SMRL): 0.113
* Average Gradient (ω<sub>i,i∈[0,96]</sub>, MRL): 1.64e-7
* Average Gradient (ω<sub>j,j∈[96,192]</sub>, MRL): 0.311
* **Epoch 30:**
* Average Gradient (ω<sub>i,i∈[0,96]</sub>, SMRL): 0.084
* Average Gradient (ω<sub>j,j∈[96,192]</sub>, SMRL): 0.082
* Average Gradient (ω<sub>i,i∈[0,96]</sub>, MRL): 9.17e-8
* Average Gradient (ω<sub>j,j∈[96,192]</sub>, MRL): 0.257
**Line Chart Data (Gradient Variance):**
* **Gradient Variance (ω<sub>k,k∈[0,192]</sub>, SMRL) - Blue Line:**
* Epoch 0: 5.556e-4
* Epoch 10: 2.24e-6
* Epoch 20: 1.64e-7
* Epoch 30: 9.17e-8
The blue line slopes downward.
* **Gradient Variance (ω<sub>k,k∈[0,192]</sub>, MRL) - Brown Line:**
* Epoch 0: 2.11e-4
* Epoch 10: 5.28e-6
* Epoch 20: 8.12e-7
* Epoch 30: 4.99e-7
The brown line slopes downward.
### Key Observations
* The average gradients for both SMRL and MRL models decrease as the number of epochs increases.
* The gradient variance for both SMRL and MRL models also decreases as the number of epochs increases.
* The gradient size is significantly larger for SMRL than MRL.
* The variance is higher for SMRL than MRL at the start, but the difference decreases over time.
### Interpretation
The chart illustrates the training behavior of two models, SMRL and MRL, by tracking the average gradient and gradient variance over epochs. The decreasing gradient size suggests that the models are learning and converging towards a stable solution. The decreasing gradient variance indicates that the models are becoming more consistent in their updates. The SMRL model initially has a larger gradient size and variance compared to the MRL model, but both models show a similar trend of decreasing gradient size and variance over time. This suggests that both models are learning effectively, but the SMRL model might be starting from a different initial state or have a different learning rate.
</details>
(c) CE Loss
Figure 7: Gradient statistics with Rank, MSE and CE loss (with the vertical axis in a logarithmic scale): average gradient magnitudes of parameters in the ranges $[0,96]$ and $[96,192]$ , as well as the gradient variance over all parameters in the range $[0,192]$ , during training.
5.2 Gradient variance of different loss functions
Section 5.1 demonstrates that MRL exhibits higher gradient variance compared to SMRL when rank loss is employed as the loss function, thereby corroborating the findings presented in Section 3.2. To enhance the validation, we conducted additional experiments on the BEIR dataset using rank loss, MSE loss and cross-entropy (CE) loss under identical settings. The results depicted in Figure 7 reveal a consistent pattern across both loss functions, validating the robustness of our conclusions.
5.3 Ablation studies
To evaluate the contribution of each component in SMEC to the overall performance, we conduct ablation studies using MRL as the baseline. Different modules are incrementally added on top of MRL, as detailed in table 1. When examined individually, the SMRL strategy achieves the most significant performance gain, suggesting that its reduced gradient variance contributes positively to model performance. In addition, both the ADS module and the S-XBM module also provide notable improvements. The combination of all three components improves the performance of the 128-dimensional embedding by 3.1 points.
| w/ SMRL w/ ADS w/ S-XBM | 0.3808 0.3765 0.3778 | 0.4621 0.4583 0.4583 | 0.4895 0.4863 0.4853 | 0.5283 0.5254 0.5256 |
| --- | --- | --- | --- | --- |
| SMEC (Ours) | 0.4053 | 0.4848 | 0.5002 | 0.5459 |
Table 1: Ablation studies of SMEC on 8 BEIR datasets with MRL as the baseline.
5.4 The contribution of ADS in preserving key information
The selection of important parameters in neural networks is a well-established research area, with numerous studies demonstrating that network parameters are often redundant. As a result, Parameter Pruning have been widely adopted for model compression. We consider ADS (or more generally, the MEC family of methods), although it focuses on dimension selection within embeddings, to be fundamentally implemented through network parameter selection. Therefore, ADS can be regarded as a form of pruning method with theoretical feasibility.
To fully demonstrate the effectiveness of ADS, we evaluate both the dimension selection strategies of ADS and MRL using WARE Yu et al. (2018) (Weighted Average Reconstruction Error), a commonly used metric in the pruning area for assessing parameter importance. The WARE is defined as follows:
$$
\text{WARE}=\frac{1}{M}\sum_{m=1}^{M}\frac{|\hat{y}_{m}-y_{m}|}{|y_{m}|} \tag{10}
$$
,where $M$ denotes the number of samples; $\hat{y}_{m}$ and $y_{m}$ represent the model’s score (which can be interpreted as the similarity between embedding pairs) for the $m$ -th sample before and after dimension pruning, respectively. The core idea of WARE is to quantify the change in the model’s output induced by removing a specific dimension; a larger change indicates higher importance of that dimension.
We randomly sampled 10,000 instances from multiple sub-datasets of BEIR. For the LLM2VEC embeddings (3072dim), we computed the WARE for each dimension. Then, we used both ADS and MRL to generate low-dimensional embeddings of 1536, 768, and 256 dimensions, respectively. For each method and compression level, we calculated the achievement rate, which is defined as the proportion of selected dimensions that appear in the top-N most important dimensions according to the WARE-based ranking.
| Dimension 1536 768 | ADS (Dimension Selection) 94.3% 90.1% | MRL (Dimension Truncation) 50.3% 32.8% |
| --- | --- | --- |
| 256 | 83.6% | 17.4% |
Table 2: Achievement Rate of Important Dimension Selection at Different Dimension Levels.
The results in table 2 show that the achievement rate of MRL is roughly linear with the compression ratio, indicating that the importance of dimensions has no strong correlation with their positions. The achievement rate of ADS also decreases as the number of retained dimensions reduces, which is due to the increased difficulty of selecting the top-N most important dimensions under higher compression ratios. However, even when compressed by a factor of 6, ADS still selects over 80 of the most important dimensions. This explains why, as seen in Figure 5, SMEC demonstrates stronger performance at lower dimensions.
5.5 Memory size of S-XBM
In this subsection, we explore how the memory size of S-XBM module affects training speed and model performance. Theoretically, as the memory size increases, it is easier for the S-XBM module to mine more hard samples, thereby improving model performance. However, an excessively large memory size may increase the retrieval time for top-k samples, which could negatively affect training efficiency. To prove this observation experimentally, we train the SMEC framework with varying memory sizes (e.g., 1000, 2000, 5000, 10000, and 15000), as illustrated in the table 3. The results demonstrate a clear trade-off between training speed and model performance. We select a memory size of 5000 as our final choice to strike a balance between them.
| Forward Time/s $\downarrow$ NDCG@10 $\uparrow$ | 0.06 0.4631 | 0.08 0.4652 | 0.11 0.4675 | 0.15 0.4682 | 0.21 0.4689 |
| --- | --- | --- | --- | --- | --- |
Table 3: Trade-off analysis of training speed and model performance under different memory size of S-XBM.
6 Conclusions
Although high-dimensional embeddings from large language models (LLMs) capture rich semantic features, their practical use is often limited by computational efficiency and storage constraints. To mitigate these limitations, Sequential Matryoshka Embedding Compression (SMEC) framework is proposed in this paper to achieve efficient embedding compression. Our proposed SMEC framework contains Sequential Matryoshka Representation Learning(SMRL) module, adaptive dimension selection (ADS) module and Selectable Cross-batch Memory (S-XBM) module. The SMRL module is designed to mitigate gradient variance during training. The ADS module is utilized to minimize information degradation during feature compression. And the S-XBM is utilized to enhance unsupervised learning between high- and low-dimensional embeddings. Compared to existing approaches, our approaches preserve higher performance at the same compression rate.
Limitations
The SMEC framework introduces only a small number of additional parameters on top of a pre-trained model and is trained using labeled data from a specific domain, along with mined hard samples, with the aim of reducing the dimensionality of the original embeddings. However, this design and objective limit its generalizability and applicability to broader scenarios. Future work could explore extending the SMEC approach to full-parameter training of representation models, enabling them to directly generate embeddings of multiple dimensions. Additionally, the feasibility of training the model on diverse datasets is also worth investigating.
References
- (1) Openai text embedding. https://platform.openai.com/docs/guides/embeddings/embeddings.
- Bai et al. (2020) Yalong Bai, Yuxiang Chen, Wei Yu, Linfang Wang, and Wei Zhang. 2020. Products-10k: A large-scale product recognition dataset.
- BehnamGhader et al. (2024) Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961.
- Beyer et al. (1999) K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. 1999. When is nearest neighbor meaningful. Springer.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Cai et al. (2024) Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. 2024. Matryoshka multimodal models. In Workshop on Video-Language Models@ NeurIPS 2024.
- Devlin et al. (2019) Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805.
- Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, and 1 others. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Frankle and Carbin (2018) Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
- Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Gu et al. (2023) Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. 2023. Matryoshka diffusion models. In The Twelfth International Conference on Learning Representations.
- Han et al. (2017) Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, and Larry S. Davis. 2017. Automatic spatially-aware fashion concept discovery. IEEE.
- He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738.
- Hu et al. (2024) Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. 2024. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems, 37:50168–50188.
- Huang et al. (2024) Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, and Lili Qiu. 2024. Llm2clip: Powerful language model unlock richer visual representation. Preprint, arXiv:2411.04997.
- Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
- Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with gpus.
- Jolliffe and Cadima (2016) Ian T. Jolliffe and Jorge Cadima. 2016. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng, 374(2065):20150202.
- Kalervo et al. (2002) Järvelin Kalervo, Jaana Kekäläinen, and Almahairi. 2002. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446.
- Kingma et al. (2013) Diederik P Kingma, Max Welling, and 1 others. 2013. Auto-encoding variational bayes.
- Kusupati et al. (2022) Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and 1 others. 2022. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249.
- Lee and Seung (2000) Daniel Lee and H Sebastian Seung. 2000. Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13.
- Li et al. (2023) Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, and Tuo Zhao. 2023. Losparse: Structured compression of large language models based on low-rank and sparse approximation. In International Conference on Machine Learning, pages 20336–20350. PMLR.
- (24) Geoffrey J Mclachlan. Discriminant analysis and statistical pattern recognition. Wiley-Interscience,.
- O Pinheiro et al. (2020) Pedro O O Pinheiro, Amjad Almahairi, Ryan Benmalek, Florian Golemo, and Aaron C Courville. 2020. Unsupervised learning of dense visual representations. Advances in neural information processing systems, 33:4489–4500.
- Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- Robertson and Walker (1994) Stephen E. Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. ACM.
- Thakur et al. (2021) Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
- Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding.
- Wang et al. (2020) Xun Wang, Haozhi Zhang, Weilin Huang, and Matthew R Scott. 2020. Cross-batch memory for embedding learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6388–6397.
- Yoon et al. (2023) Jinsung Yoon, Sercan O Arik, Yanfei Chen, and Tomas Pfister. 2023. Search-adaptor: Embedding customization for information retrieval. arXiv preprint arXiv:2310.08750.
- Yoon et al. (2024) Jinsung Yoon, Rajarishi Sinha, Sercan O Arik, and Tomas Pfister. 2024. Matryoshka-adaptor: Unsupervised and supervised tuning for smaller embedding dimensions. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10318–10336, Miami, Florida, USA. Association for Computational Linguistics.
- Yu et al. (2018) Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. 2018. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9194–9203.
- Yury et al. (2018) Yury, A, Malkov, Dmitry, A, and Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Appendix A Derivation of the Gradient Fluctuation
To formalize this issue, we analyze the Mean Squared Error (MSE) loss as a representative case. Let $\mathbf{x}_{1}=[x_{1},x_{2},...,x_{n}]^{→p}∈\mathbb{R}^{n}$ and $\mathbf{x}_{2}=[y_{1},y_{2},...,y_{n}]^{→p}∈\mathbb{R}^{n}$ denote two input feature vectors. The final FC layer employs a matrix $\mathbf{W}=[\mathbf{w}_{1},\mathbf{w}_{2},...,\mathbf{w}_{n}]^{→p}∈\mathbb{R}^{m× n}$ to generate scalar outputs $\mathbf{y}_{1}=\mathbf{W}\mathbf{x}_{1}∈\mathbb{R}^{m}$ and $\mathbf{y}_{2}=\mathbf{W}\mathbf{x}_{2}∈\mathbb{R}^{m}$ . The MSE loss at dimension $d$ is defined as:
$$
\mathcal{L}^{d}=\left[\mathcal{Y}_{label}-sim(\mathbf{y}^{d}_{1},\mathbf{y}^{d}_{2})\right]^{2}, \tag{11}
$$
where $\mathcal{Y}_{label}$ denotes the binary classification label for pairs (0 or 1), and $sim(·)$ represents the normalized similarity of the learned representations.
According to the chain rule, the partial derivative of $\mathcal{L}^{d}$ with respect to the $i$ -th dimension parameter of the FC layer is derived as:
$$
\frac{\partial\mathcal{L}^{d}}{\partial\mathbf{w}_{i}}=\frac{\partial\mathcal{L}^{d}}{\partial\left[\mathbf{y}^{d}_{1}\right]_{i}}\cdot\frac{\partial\left[\mathbf{y}^{d}_{1}\right]_{i}}{\partial\mathbf{w}_{i}}+\frac{\partial\mathcal{L}^{d}}{\partial\left[\mathbf{y}^{d}_{2}\right]_{i}}\cdot\frac{\partial\left[\mathbf{y}^{d}_{2}\right]_{i}}{\partial\mathbf{w}_{i}}. \tag{12}
$$
Utilizing cosine similarity (clamp to $[0,1]$ ) as the similarity function $sim(·)$ , the equation 11 can be rewritten as:
$$
\mathcal{L}^{d}=\left[\mathcal{Y}_{label}-\frac{{\mathbf{y}^{d}_{1}}^{\top}{\mathbf{y}^{d}_{2}}}{\|\mathbf{y}^{d}_{1}\|\|\mathbf{y}^{d}_{2}\|}\right]^{2}. \tag{13}
$$
Let $\|\mathbf{y}^{d}_{1}\|=A$ , $\|\mathbf{y}^{d}_{2}\|=B$ , ${\mathbf{y}^{d}_{1}}^{→p}{\mathbf{y}^{d}_{2}}=C$ and $s=\frac{C}{AB}$ . The partial derivatives of the $\mathcal{L}^{d}$ with respect to $\left[\mathbf{y}^{d}_{1}\right]_{i}$ and $\left[\mathbf{y}^{d}_{2}\right]_{i}$ are given as follows:
$$
\frac{\partial\mathcal{L}^{d}}{\partial\left[\mathbf{y}^{d}_{1}\right]_{i}}=2\left(s-\mathcal{Y}_{label}\right)\left(\frac{\left[\mathbf{y}^{d}_{2}\right]_{i}}{AB}-\frac{s}{A^{2}}\left[\mathbf{y}^{d}_{1}\right]_{i}\right), \tag{14}
$$
$$
\frac{\partial\mathcal{L}^{d}}{\partial\left[\mathbf{y}^{d}_{2}\right]_{i}}=2\left(s-\mathcal{Y}_{label}\right)\left(\frac{\left[\mathbf{y}^{d}_{1}\right]_{i}}{AB}-\frac{s}{B^{2}}\left[\mathbf{y}^{d}_{2}\right]_{i}\right). \tag{15}
$$
Substituting $\left[\mathbf{y}^{d}_{1}\right]_{i}=\mathbf{w}_{i}\mathbf{x}_{1}$ and $\left[\mathbf{y}^{d}_{2}\right]_{i}=\mathbf{w}_{i}\mathbf{x}_{2}$ , the partial derivatives of the $\left[\mathbf{y}^{d}_{1}\right]_{i}$ and $\left[\mathbf{y}^{d}_{2}\right]_{i}$ with respect to $\mathbf{w}_{i}$ are given as follows:
$$
\frac{\partial\left[\mathbf{y}^{d}_{1}\right]_{i}}{\partial\mathbf{w}_{i}}=\mathbf{x}_{1},\frac{\partial\left[\mathbf{y}^{d}_{2}\right]_{i}}{\partial\mathbf{w}_{i}}=\mathbf{x}_{2}. \tag{16}
$$
Based on the above equations, the partial derivative of $\mathcal{L}^{d}$ with respect to $\mathbf{w}_{i}$ is derived as:
$$
\displaystyle\frac{\partial\mathcal{L}^{d}}{\partial\mathbf{w}_{i}}= \displaystyle\ 2\left(s-\mathcal{Y}_{label}\right)\left[\left(\frac{\left[\mathbf{y}^{d}_{2}\right]_{i}}{AB}-\frac{s}{A^{2}}\left[\mathbf{y}^{d}_{1}\right]_{i}\right)\mathbf{x}_{1}+\left(\frac{\left[\mathbf{y}^{d}_{1}\right]_{i}}{AB}-\frac{s}{B^{2}}\left[\mathbf{y}^{d}_{2}\right]_{i}\right)\mathbf{x}_{2}.\right] \tag{17}
$$
Assume that $A$ and $B$ can be approximated by $\delta(d)· a$ and $\delta(d)· b$ , respectively. Under this approximation, $\delta(d)$ can be used to fit the relationship between the magnitude of vector $\mathbf{x}$ or $\mathbf{y}$ and the variation of $d$ (It is evident that this is a positive correlation). Therefore, Equation 17 can be approximated by the following expression:
$$
\frac{\partial\mathcal{L}^{d}}{\partial\mathbf{w}_{i}}=2\left(s-\mathcal{Y}_{label}\right)\frac{1}{\delta(d)^{2}}\Bigg[\left(\frac{\left[\mathbf{y}^{d}_{2}\right]_{i}}{ab}-\frac{s}{a^{2}}\left[\mathbf{y}^{d}_{1}\right]_{i}\right)\mathbf{x}_{1}+\left(\frac{\left[\mathbf{y}^{d}_{1}\right]_{i}}{ab}-\frac{s}{b^{2}}\left[\mathbf{y}^{d}_{2}\right]_{i}\right)\mathbf{x}_{2}\Bigg]. \tag{18}
$$
In equation 18, $a$ , $b$ , $\left[\mathbf{y}^{d}_{1}\right]_{i}$ , $\left[\mathbf{y}^{d}_{2}\right]_{i}$ are constants, $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ are constant vectors, while $s$ and $\mathcal{Y}_{label}$ are invariant with respect to the index $d$ . Therefore, we can conclude the following:
$$
\frac{\partial\mathcal{L}^{d}}{\partial\mathbf{w}_{i}}\propto\frac{1}{\delta(d)^{2}}. \tag{19}
$$
In theory, this rule can also be extended to other pair-wise similarity-based functions, such as rank loss, which is experimentally verified in Section 5.2.
Appendix B Results on BEIR Sub-datasets.
We compare the performance of different compression methods on several representative sub-datasets of BEIR, and the results are shown in Table 4.
| Model Sub-dataset–Scifact | NDCG@10 128 | 256 | 512 | 768 | 1536 | 3072 |
| --- | --- | --- | --- | --- | --- | --- |
| LLM2Vec | - | - | - | - | - | 0.787 |
| w/ Search-Adaptor | 0.806 | 0.845 | 0.864 | 0.879 | 0.886 | 0.884 |
| w/ MRL-Adaptor | 0.826 | 0.861 | 0.876 | 0.880 | 0.886 | 0.887 |
| w/ SMEC (ours) | 0.841 | 0.874 | 0.879 | 0.882 | 0.885 | 0.886 |
| Sub-dataset–FiQA | | | | | | |
| LLM2Vec | - | - | - | - | - | 0.498 |
| w/ Search-Adaptor | 0.475 | 0.505 | 0.529 | 0.540 | 0.545 | 0.550 |
| w/ MRL-Adaptor | 0.496 | 0.523 | 0.534 | 0.543 | 0.547 | 0.550 |
| w/ SMEC (ours) | 0.521 | 0.533 | 0.540 | 0.546 | 0.549 | 0.551 |
| Sub-dataset–Quora | | | | | | |
| LLM2Vec | - | - | - | - | - | 0.775 |
| w/ Search-Adaptor | 0.771 | 0.805 | 0.830 | 0.845 | 0.861 | 0.864 |
| w/ MRL-Adaptor | 0.784 | 0.812 | 0.834 | 0.847 | 0.862 | 0.863 |
| w/ SMEC (ours) | 0.794 | 0.818 | 0.839 | 0.850 | 0.862 | 0.865 |
| Sub-dataset–NFCorpus | | | | | | |
| LLM2Vec | - | - | - | - | - | 0.389 |
| w/ Search-Adaptor | 0.345 | 0.375 | 0.396 | 0.412 | 0.425 | 0.426 |
| w/ MRL-Adaptor | 0.364 | 0.384 | 0.403 | 0.419 | 0.426 | 0.427 |
| w/ SMEC (ours) | 0.389 | 0.402 | 0.418 | 0.426 | 0.430 | 0.431 |
| Sub-dataset–SciDocs | | | | | | |
| LLM2Vec | - | - | - | - | - | 0.232 |
| w/ Search-Adaptor | 0.204 | 0.225 | 0.245 | 0.250 | 0.258 | 0.263 |
| w/ MRL-Adaptor | 0.220 | 0.240 | 0.250 | 0.255 | 0.262 | 0.265 |
| w/ SMEC (ours) | 0.239 | 0.246 | 0.251 | 0.255 | 0.261 | 0.264 |
Table 4: Comparison of retrieval performance on 5 BEIR sub-datasets.
Appendix C Experimental results on Products-10K and Fashion-200k.
<details>
<summary>figures/fig_img.png Details</summary>

### Visual Description
## Line Chart: NDCG@10/% vs. Dimensions for Different Algorithms
### Overview
The image is a line chart comparing the performance of four different algorithms (PCA, Search-Adaptor, MRL-Adaptor, and SMEC) based on their NDCG@10/% score across varying dimensions. The x-axis represents the number of dimensions, and the y-axis represents the NDCG@10/% score.
### Components/Axes
* **Title:** There is no explicit title on the chart.
* **X-axis:**
* Label: "Dimensions"
* Scale: 128, 256, 512, 768, 1024
* **Y-axis:**
* Label: "NDCG@10/%"
* Scale: 40, 42, 44, 46, 48, 50, 52
* **Legend:** Located in the bottom-right corner.
* PCA (Blue line with circle markers)
* Search-Adaptor (Orange line with cross markers)
* MRL-Adaptor (Green line with triangle markers)
* SMEC (Yellow/Orange line with square markers)
### Detailed Analysis
* **PCA (Blue):** The line starts at approximately 40 at 128 dimensions, increases to approximately 46 at 256 dimensions, reaches approximately 47.3 at 512 dimensions, then increases to approximately 48.1 at 768 dimensions, and ends at approximately 48.6 at 1024 dimensions. The trend is upward, with the most significant increase between 128 and 256 dimensions.
* **Search-Adaptor (Orange):** The line starts at approximately 46.5 at 128 dimensions, increases to approximately 49.8 at 256 dimensions, reaches approximately 51.5 at 512 dimensions, then increases to approximately 52.3 at 768 dimensions, and ends at approximately 52.6 at 1024 dimensions. The trend is upward, with a decreasing rate of increase as dimensions increase.
* **MRL-Adaptor (Green):** The line starts at approximately 49 at 128 dimensions, increases to approximately 50.3 at 256 dimensions, reaches approximately 51.8 at 512 dimensions, then increases to approximately 52.4 at 768 dimensions, and ends at approximately 52.6 at 1024 dimensions. The trend is upward, with a decreasing rate of increase as dimensions increase.
* **SMEC (Yellow/Orange):** The line starts at approximately 49.8 at 128 dimensions, increases to approximately 51.3 at 256 dimensions, reaches approximately 52 at 512 dimensions, then increases to approximately 52.5 at 768 dimensions, and ends at approximately 52.7 at 1024 dimensions. The trend is upward, with a decreasing rate of increase as dimensions increase.
### Key Observations
* SMEC and MRL-Adaptor consistently outperform PCA and Search-Adaptor across all dimensions.
* The performance of all algorithms generally improves as the number of dimensions increases, but the rate of improvement decreases at higher dimensions.
* PCA has the lowest NDCG@10/% score across all dimensions.
* The performance difference between MRL-Adaptor and SMEC is minimal, especially at higher dimensions.
### Interpretation
The chart suggests that increasing the number of dimensions generally improves the performance of these algorithms, as measured by NDCG@10/%. However, there are diminishing returns, as the rate of improvement decreases at higher dimensions. The choice of algorithm significantly impacts performance, with SMEC and MRL-Adaptor consistently outperforming PCA and Search-Adaptor. PCA shows the most significant performance gain from 128 to 256 dimensions, indicating that increasing dimensions is particularly beneficial for this algorithm in the lower range. The close performance of MRL-Adaptor and SMEC suggests they may have similar underlying mechanisms or be optimized for similar data characteristics.
</details>
(a) Image retrieval
<details>
<summary>figures/fig_t2i.png Details</summary>

### Visual Description
## Line Chart: NDCG@10/% vs. Dimensions for Different Methods
### Overview
The image is a line chart comparing the performance of four different methods (PCA, Search-Adaptor, MRL-Adaptor, and SMEC) based on their NDCG@10/% scores across varying dimensions. The x-axis represents the number of dimensions, and the y-axis represents the NDCG@10/% score.
### Components/Axes
* **X-axis:** Dimensions, with markers at 128, 256, 384, 512, and 768.
* **Y-axis:** NDCG@10/%, with markers at 14, 16, 18, 20, 22, and 24.
* **Legend (bottom-right):**
* Blue line with circle marker: PCA
* Orange line with cross marker: Search-Adaptor
* Green line with triangle marker: MRL-Adaptor
* Yellow line with square marker: SMEC
### Detailed Analysis
* **PCA (Blue):** The PCA line starts at approximately 13.5 NDCG@10/% at 128 dimensions and increases to approximately 16.4 at 256 dimensions, 17.9 at 384 dimensions, 18.4 at 512 dimensions, and 18.8 at 768 dimensions. The trend is upward, but the rate of increase slows down as the number of dimensions increases.
* **Search-Adaptor (Orange):** The Search-Adaptor line starts at approximately 19.3 NDCG@10/% at 128 dimensions and increases to approximately 20.5 at 256 dimensions, 21.8 at 384 dimensions, 22.6 at 512 dimensions, and 23.1 at 768 dimensions. The trend is upward.
* **MRL-Adaptor (Green):** The MRL-Adaptor line starts at approximately 22.0 NDCG@10/% at 128 dimensions and increases to approximately 22.8 at 256 dimensions, 23.5 at 384 dimensions, 24.0 at 512 dimensions, and 24.2 at 768 dimensions. The trend is upward, with a slight flattening towards the higher dimensions.
* **SMEC (Yellow):** The SMEC line starts at approximately 22.6 NDCG@10/% at 128 dimensions and increases to approximately 23.8 at 384 dimensions, 24.1 at 512 dimensions, and 24.3 at 768 dimensions. The trend is upward, with a slight flattening towards the higher dimensions.
### Key Observations
* SMEC and MRL-Adaptor consistently outperform PCA and Search-Adaptor across all dimensions.
* PCA has the lowest NDCG@10/% scores across all dimensions.
* The performance of all methods generally improves as the number of dimensions increases, but the rate of improvement decreases at higher dimensions.
### Interpretation
The chart demonstrates the relationship between the number of dimensions and the NDCG@10/% score for different methods. SMEC and MRL-Adaptor appear to be the most effective methods for this particular task, as they consistently achieve higher NDCG@10/% scores compared to PCA and Search-Adaptor. The diminishing returns observed at higher dimensions suggest that there may be an optimal number of dimensions beyond which further increases do not significantly improve performance. PCA's relatively poor performance indicates that it may not be well-suited for this task or requires further optimization.
</details>
(b) Text-to-Image retrieval
<details>
<summary>figures/fig_i2t.png Details</summary>

### Visual Description
## Line Chart: NDCG@10/% vs. Dimensions for Different Methods
### Overview
The image is a line chart comparing the performance of four different methods (PCA, Search-Adaptor, MRL-Adaptor, and SMEC) based on their NDCG@10/% scores across varying dimensions. The chart displays how the NDCG@10/% score changes as the number of dimensions increases from 128 to 768.
### Components/Axes
* **X-axis (Horizontal):** "Dimensions" with tick marks at 128, 256, 384, 512, and 768.
* **Y-axis (Vertical):** "NDCG@10/%" with tick marks at 14, 16, 18, 20, 22, and 24.
* **Legend (Bottom-Right):**
* Blue line with circle markers: PCA
* Orange line with cross markers: Search-Adaptor
* Green line with triangle markers: MRL-Adaptor
* Yellow/Orange line with square markers: SMEC
### Detailed Analysis
* **PCA (Blue):** The blue line, representing PCA, shows a generally upward trend.
* At 128 dimensions, NDCG@10/% is approximately 13.2.
* At 256 dimensions, NDCG@10/% is approximately 15.1.
* At 384 dimensions, NDCG@10/% is approximately 16.3.
* At 512 dimensions, NDCG@10/% is approximately 17.5.
* At 768 dimensions, NDCG@10/% is approximately 18.0.
* **Search-Adaptor (Orange):** The orange line, representing Search-Adaptor, shows a consistent upward trend.
* At 128 dimensions, NDCG@10/% is approximately 18.1.
* At 256 dimensions, NDCG@10/% is approximately 19.2.
* At 384 dimensions, NDCG@10/% is approximately 20.6.
* At 512 dimensions, NDCG@10/% is approximately 21.8.
* At 768 dimensions, NDCG@10/% is approximately 22.6.
* **MRL-Adaptor (Green):** The green line, representing MRL-Adaptor, shows a slight upward trend.
* At 128 dimensions, NDCG@10/% is approximately 21.9.
* At 256 dimensions, NDCG@10/% is approximately 22.1.
* At 384 dimensions, NDCG@10/% is approximately 22.7.
* At 512 dimensions, NDCG@10/% is approximately 23.3.
* At 768 dimensions, NDCG@10/% is approximately 23.8.
* **SMEC (Yellow/Orange):** The yellow/orange line, representing SMEC, shows a slight upward trend.
* At 128 dimensions, NDCG@10/% is approximately 22.5.
* At 256 dimensions, NDCG@10/% is approximately 22.7.
* At 384 dimensions, NDCG@10/% is approximately 23.1.
* At 512 dimensions, NDCG@10/% is approximately 23.4.
* At 768 dimensions, NDCG@10/% is approximately 23.9.
### Key Observations
* PCA consistently performs the worst among the four methods across all dimensions.
* Search-Adaptor shows the most significant improvement in NDCG@10/% as dimensions increase.
* MRL-Adaptor and SMEC perform similarly and generally outperform PCA and Search-Adaptor.
* The performance of MRL-Adaptor and SMEC plateaus as dimensions increase beyond 512.
### Interpretation
The chart illustrates the impact of dimensionality on the performance of different methods, as measured by NDCG@10/%. PCA's lower performance suggests it may not be as effective in capturing relevant information in lower dimensions compared to the other methods. Search-Adaptor benefits significantly from increased dimensionality, indicating it can leverage additional features to improve its ranking performance. MRL-Adaptor and SMEC achieve higher NDCG@10/% scores, suggesting they are more robust or better suited for this particular task. The plateauing of MRL-Adaptor and SMEC suggests that there may be diminishing returns in increasing dimensions beyond a certain point, possibly due to overfitting or the introduction of irrelevant features.
</details>
(c) Image-to-Text retrieval
Figure 8: Experimental results on image and multimodal datasets. (a) presents the results on the Products-10K dataset using an image representation model based on ViT-H (with 1024 dimensions). (b) and (c) show the results on the Fashion-200K dataset for text-to-image and image-to-text retrieval tasks, respectively, using the LLM2CLIP model (with 768 dimensions, base on ViT-L/14 and Llama-3.2-1B).