# SMEC:Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
**Authors**: Hangzhou, China, &Bo Zheng, Beijing, China
## Abstract
Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.
SMEC:Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
Biao Zhang, Lixin Chen, Tong Liu Taobao & Tmall Group of Alibaba Hangzhou, China {zb372670,tianyou.clx,yingmu}@taobao.com Bo Zheng Taobao & Tmall Group of Alibaba Beijing, China bozheng@alibaba-inc.com
## 1 Introduction
<details>
<summary>figures/fig_intr.png Details</summary>

### Visual Description
## Line Chart: NDCG@10 vs. Embedding Dimensions
### Overview
This line chart illustrates the relationship between embedding dimensions and NDCG@10 scores for four different models: LLM2Vec-7B, LLM2Vec-7B with SMEC, LLM2Vec-1B, and LLM2Vec-1B with SMEC. The chart demonstrates how performance (NDCG@10) changes as the dimensionality of the embeddings increases. Two compression ratios are indicated: ~14x and ~12x.
### Components/Axes
* **X-axis:** Embedding dimensions, ranging from 128 to 3584. Markers are placed at 128, 256, 512, 1024, 1536, and 3584.
* **Y-axis:** NDCG@10, ranging from 0.4 to 0.9.
* **Legend:** Located in the bottom-right corner.
* LLM2Vec-7B (Solid Blue Line)
* LLM2Vec-7B(w/ SMEC) (Solid Red Line)
* LLM2Vec-1B (Dashed Blue Line)
* LLM2Vec-1B(w/ SMEC) (Dashed Green Line)
* **Annotations:**
* ā~14x lossless dimension compressionā ā positioned above the LLM2Vec-7B and LLM2Vec-7B(w/ SMEC) lines.
* ā~12x lossless dimension compressionā ā positioned below the LLM2Vec-1B and LLM2Vec-1B(w/ SMEC) lines.
### Detailed Analysis
Here's a breakdown of the data series and their values:
* **LLM2Vec-7B (Solid Blue Line):** This line slopes upward, indicating increasing NDCG@10 with increasing embedding dimensions.
* 128: 0.492
* 256: 0.576
* 512: 0.648
* 1024: 0.743
* 1536: 0.77
* 3584: 0.803
* **LLM2Vec-7B(w/ SMEC) (Solid Red Line):** This line also slopes upward, consistently outperforming LLM2Vec-7B.
* 128: 0.718
* 256: 0.803
* 512: 0.832
* 1024: 0.844
* 1536: 0.852
* 3584: 0.862
* **LLM2Vec-1B (Dashed Blue Line):** This line slopes upward, but starts at a lower NDCG@10 and generally performs worse than LLM2Vec-7B.
* 128: 0.568
* 256: 0.648
* 512: 0.635
* 1024: 0.684
* 1536: 0.715
* 3584: 0.757
* **LLM2Vec-1B(w/ SMEC) (Dashed Green Line):** This line slopes upward, and consistently outperforms LLM2Vec-1B.
* 128: 0.707
* 256: 0.73
* 512: 0.77
* 1024: 0.784
* 1536: 0.793
* 3584: 0.802
### Key Observations
* The models with SMEC consistently outperform their counterparts without SMEC across all embedding dimensions.
* LLM2Vec-7B generally outperforms LLM2Vec-1B, even without SMEC.
* The rate of improvement in NDCG@10 diminishes as embedding dimensions increase, particularly for the models with SMEC.
* The annotations indicate that the models achieve significant lossless dimension compression (approximately 14x for LLM2Vec-7B and 12x for LLM2Vec-1B).
### Interpretation
The data suggests that increasing embedding dimensions generally improves retrieval performance (as measured by NDCG@10). However, the gains from increasing dimensionality become smaller at higher dimensions. The consistent performance improvement from using SMEC indicates that this technique effectively enhances the quality of the embeddings, leading to better retrieval results. The annotations highlight the efficiency of the models, demonstrating that they can achieve good performance with significantly reduced dimensionality, which is beneficial for storage and computational costs. The difference in compression ratios between the 7B and 1B models could be due to differences in model architecture or training data. The flattening of the curves at higher dimensions suggests a point of diminishing returns, where further increasing dimensionality does not yield substantial improvements in performance.
</details>
Figure 1: The effectiveness of the SMEC in dimensionality reduction. After customized training with the SMEC method on BEIR Quora dataset, the embeddings of LLM2Vec-7B (3584 dimensions) and LLM2Vec-1B (1536 dimensions) can achieve 14 $\times$ and 12 $\times$ lossless compression, respectively.
<details>
<summary>figures/overview.png Details</summary>

### Visual Description
## Diagram: Model Architectures - Search-Adaptor, Matryoshka-Adaptor, and SMEC
### Overview
The image presents a comparative diagram illustrating the architectures of three different models: Search-Adaptor, Matryoshka-Adaptor, and SMEC (the authors' model). Each model is depicted as a series of processing blocks connected by arrows, representing the flow of data. The diagram highlights the key differences in their structures, particularly in the adaptation layers.
### Components/Axes
The diagram is divided into three main sections, labeled (a) Search-Adaptor, (b) Matryoshka-Adaptor, and (c) SMEC(ours). Each section shows the model's architecture. Common components include:
* **Input:** The starting point of the data flow.
* **Encoder:** A block that processes the input.
* **Fully Connected Layer:** Layers with varying dimensions (MxN, N, Mx4, Mx2, Mx/4) are used.
* **L(X1:N/β), L(X1:N/Γ), L(X1:N/2):** Loss functions or layers.
* **Step 1, Step 2, Step 3:** Sequential steps in the SMEC model.
* **Matryoshka-Adaptor & SMEC(ours):** Include visual representations of nested dolls (Matryoshka) and a grid of smaller images, respectively.
### Detailed Analysis or Content Details
**(a) Search-Adaptor:**
* Input flows into an Encoder, resulting in output 'N'.
* 'N' is fed into four Fully Connected Layers, each with dimensions MxN, N, Mx4, and Mx2, respectively.
* These layers are connected to three loss functions: L(X1:N/β), L(X1:N/Γ), and L(X1:N/2).
* The final output is L(X1:N).
**(b) Matryoshka-Adaptor:**
* Input flows into an Encoder, resulting in output 'N'.
* 'N' is fed into a single Fully Connected Layer with dimensions MxN.
* This layer is connected to three loss functions: L(X1:N/β), L(X1:N/Γ), and L(X1:N/2).
* The final output is L(X).
**(c) SMEC(ours):**
* Input flows into an Encoder, resulting in output 'N'.
* 'N' is fed into a Fully Connected Layer with dimensions MxN.
* **Step 1:** The output is connected to loss function L1(x).
* **Step 2:** The output is fed into a "Sub-Fully Connected Layer" with dimensions MxN/2, connected to loss function L2(x).
* **Step 3:** The output of Step 2 is fed into another "Sub-Fully Connected Layer" with dimensions MxN/4, connected to loss function L3(x).
* The final output is L(X1:N/β), L(X1:N/Γ), and L(X1:N/2).
* A visual representation of Matryoshka dolls is shown, with the largest doll representing the Matryoshka-Adaptor and the SMEC(ours) model represented by a grid of smaller images.
### Key Observations
* The SMEC model introduces a multi-step adaptation process with progressively smaller fully connected layers.
* The Matryoshka-Adaptor uses a single adaptation layer, while the Search-Adaptor uses multiple parallel adaptation layers.
* The SMEC model appears to build upon the Matryoshka-Adaptor concept by adding sequential adaptation steps.
* The visual representations of the Matryoshka dolls and the grid of images in the SMEC model suggest a hierarchical or nested structure.
### Interpretation
The diagram illustrates a progression in model architecture design for adaptation tasks. The Search-Adaptor explores multiple adaptation pathways simultaneously, while the Matryoshka-Adaptor adopts a single, more focused adaptation layer. The SMEC model, presented as the authors' contribution, refines this approach by introducing a sequential, hierarchical adaptation process. The use of smaller fully connected layers in each step of SMEC suggests a gradual refinement of the adaptation process, potentially leading to more precise and efficient adaptation. The visual metaphors of the Matryoshka dolls and the grid of images reinforce the idea of nested structures and hierarchical relationships within the models. The loss functions (L(X1:N/β), L(X1:N/Γ), L(X1:N/2)) likely represent different aspects of the adaptation process, and their presence in each model indicates a common focus on minimizing adaptation error. The diagram doesn't provide specific data or numerical values, but it effectively communicates the structural differences between the three models and their potential implications for adaptation performance. The diagram is a conceptual illustration of model architectures, not a presentation of experimental results.
</details>
Figure 2: Illustration of embedding compression architectures and our proposed approach. (a) presents the direct feature dimensionality reduction performed by the Search-Adaptor using FC layers. (b) illustrates the Matryoshka-Adaptor, which employs a shared set of FC layers to generate low-dimensional embeddings with multiple output dimensions. A Matryoshka-like hierarchical inclusion relationship exists between the high- and low-dimensional embeddings. (c) presents our proposed Sequential Matryoshka Embedding Compression (SMEC) framework, which adopts a sequential approach to progressively reduce high-dimensional embeddings to the target dimension. The animated diagram in the upper-right corner vividly highlights the distinction between Matryoshka-Adaptor and SMEC.
Large language models excel in diverse text tasks due to their ability to capture nuanced linguistic structures and contextual dependencies. For instance, GPT-4 achieves state-of-the-art performance on benchmarks like GLUE Wang et al. (2018) and SuperGLUE Wang et al. (2019), demonstrating its proficiency in tasks such as natural language inference (NLI), question answering (QA), and text classification. This success is attributed to their transformer-based architectures Vaswani et al. (2017), which enable parallel processing of sequential data and capture long-range dependencies through self-attention mechanisms. Similarly, Llama-3 Grattafiori et al. (2024) and ChatGPT Brown et al. (2020) leverage similar principles to achieve comparable or superior performance in domain-specific and multi-lingual tasks.
LLMs are increasingly integrated into commercial information retrieval (IR) systems, such as search engines (e.g., Googleās MUM) and recommendation platforms (e.g., Netflixās content retrieval). Their ability to generate embeddings for long documents (e.g., books, research papers) and dynamic queries (e.g., conversational search) makes them indispensable for modern applications. For example, the BEIR benchmark Thakur et al. (2021) evaluates cross-domain retrieval performance, where LLMs outperform traditional BM25 Robertson and Walker (1994) and BERT-based models Devlin et al. (2019) by leveraging contextual embeddings.
While LLMsā high-dimensional embeddings enable sophisticated semantic modeling, their storage and computational costs hinder scalability. Embedding dimensions of LLMs typically range from 1,024 (e.g., GPT-3) to 4,096 (e.g., Llama-3), exacerbating storage overhead and computational inefficiencyāespecially in real-time systems requiring dynamic updates. Moreover, high-dimensional vectors degrade the performance of retrieval algorithms due to the curse of dimensionality Beyer et al. (1999). For example: exact nearest-neighbor search in high-dimensional spaces becomes computationally infeasible, necessitating approximate methods like FAISS Johnson et al. (2017) or HNSW Yury et al. (2018). Even with optimizations, query latency increases exponentially with dimensionality, limiting responsiveness in real-world applications.
To address these challenges, Matryoshka Representation Learning (MRL) Kusupati et al. (2022) encodes multi-scale information into a single embedding, balancing task complexity and efficiency. It achieves strong results in large-scale classification and retrieval tasks and has inspired variants like Matryoshka-Adaptor Yoon et al. (2024), which offers a scalable framework for transforming embeddings into structured representations with Matryoshka properties under both supervised and unsupervised settings. However, MRLās multi-scale parallel training strategy simultaneously limits its practical application in industry. When the retrieval system requires a new low-dimensional embedding, retraining from scratch is necessary to achieve effective dimensionality reduction.
In this paper, we systematically analyze the limitations of MRL and its variants in embedding compression and propose three key enhancements: (1) a continued-training-friendly training framework named Sequential Matryoshka Representation Learning (SMRL); (2) an adaptive dimension selection (ADS) mechanism to minimize information degradation during dimension pruning; and (3) a Selectable Cross-batch Memory (S-XBM) strategy to enhance unsupervised learning between high- and low-dimensional embeddings.
## 2 Related Work
### 2.1 Matryoshka representation learning
Matryoshka representation learning introduces a novel paradigm where embeddings are pretrained to inherently support progressive dimension truncation. This enables fine-grained control over the trade-off between computational latency (via reduced dimensionality) and accuracy (via retained semantic structure). Key innovations include the design of Matryoshka properties, such as hierarchical information encoding and intra-cluster compactness, which ensure that even truncated embeddings retain utility for downstream tasks.
In addition to representation learning, the concept of MRL have been applied to image generation, such as Matryoshka Diffusion Models (MDM) Gu et al. (2023); multimodal content understanding, such as $M^{3}$ Cai et al. (2024); and Multimodal Large Language Model (MLLM), such as Matryoshka Query Transformer (MQT) Hu et al. (2024).
### 2.2 Embedding Compression
Embedding compression aims to reduce the computational and memory footprint of neural network models or embeddings while preserving their utility for downstream tasks. This objective has driven research across multiple paradigms, each addressing different trade-offs between compression efficiency, performance retention, and adaptability. Early approaches primarily focused on unsupervised techniques based on linear algebra, such as Principal Component Analysis (PCA) Jolliffe and Cadima (2016), Linear Discriminant Analysis (LDA) Mclachlan , and Non-negative Matrix Factorization (NMF) Lee and Seung (2000). Building upon these, autoencoders and their variants, such as Variational Autoencoders (VAEs) Kingma et al. (2013), have gradually emerged as powerful tools for nonlinear dimensionality reduction, capable of capturing complex data distributions. With the development of deep learning, methods such as Contrastive Predictive Coding (CPC) Oord et al. (2018) and Momentum Contrast (MoCo) He et al. (2020) are capable of learning robust and compact representations from unlabeled data.
Recently, customized methods such as Search-Adaptor Yoon et al. (2023) and Matryoshka-Adaptor Yoon et al. (2024) have emerged as a new trend in embedding compression. They achieve significant dimensionality reduction by adding only a small number of parameters to the original representation model and retraining it on specific data.
## 3 Method
### 3.1 Rethinking MRL for embedding compression
MRL employs a nested-dimensional architecture to train models that learn hierarchical feature representations across multiple granularities. This allows adaptive deployment of models based on computational constraints. Specifically, MRL defines a series of models $f_{1},f_{2},\ldots,f_{M}$ that share identical input and output spaces but progressively expand their hidden dimensions.
The term Matryoshka derives from the hierarchical parameter structure where the parameters of model $f_{m}$ are nested within those of its successor $f_{m+1}$ . To illustrate, consider a FC layer within the largest model $f_{M}$ , which contains $d_{M}$ neurons in its hidden layer. Correspondingly, the FC layer of $f_{m}$ retains the first $d_{m}$ neurons of this structure, with dimensions satisfying $d_{1}\leq d_{2}\leq\dots\leq d_{M}$ . MRL jointly trains these models using the following objective:
$$
\sum_{m=1}^{M}c_{m}\cdot\mathcal{L}(f_{m}(\mathbf{x});y), \tag{1}
$$
where $\mathcal{L}$ denotes the loss function, $y$ represents the ground-truth label, and $c_{m}$ are task-specific weighting coefficients. Notably, each training iteration requires forward and backward propagation for all $M$ models, resulting in substantial computational overhead compared to training a single standalone model. Upon convergence, MRL enables flexible inference by selecting any intermediate dimension $d_{i}\leq d_{M}$ , thereby accommodating diverse computational constraints.
Although the MRL method has partially mitigated the performance degradation of representations during dimensionality reduction, we contend that it still faces the following three unresolved issues:
Gradient Fluctuation. In large-scale vector retrieval systems, sample similarity is measured by the distance between their representation vectors. Consequently, the optimization of embedding models typically employs loss functions based on embedding similarity. In this condition, according to the derivation in Appendix A, the loss function $\mathcal{L}^{d}$ of MRL under dimension $d$ satisfies the following relationship with respect to the parameter $\mathbf{w}_{i}$ in the $i$ -th dimension of the FC layer:
$$
\frac{\partial\mathcal{L}^{d}}{\partial\mathbf{w}_{i}}\propto\frac{1}{\delta(d)^{2}}. \tag{2}
$$
Here, $\delta(d)$ is a complex function that is positively correlated with the dimension $d$ . This equation provides a mathematical foundation for analyzing gradient fluctuations in multi-dimensional joint optimization architecture. It indicates that during the MRL training process, loss functions from various dimensions result in gradients of varying magnitudes on the same model parameter, thereby increasing gradient variance. In Section 5.2, we empirically demonstrated that the conclusion above is applicable to different loss functions. We propose a solution to resolve the aforementioned problem in Section 3.2.
Information Degradation. Neural network parameters exhibit heterogeneous contributions to model performance, as demonstrated by the non-uniform distribution of their gradients and feature importance metrics Frankle and Carbin (2018). The MRL method employs a dimension truncation strategy (e.g., $D\rightarrow D/2\rightarrow D/4\ldots$ ) to prune parameters and reduce feature dimensions by retaining partial parameters. However, this approach fails to adequately preserve critical parameters because it relies on a rigid, static truncation rule. Although MRL employs joint training of high- and low-dimensional vectors to redistribute information between truncated and retained parameters, this process is unavoidably accompanied by information degradation. Specifically, discarded parameters may contain essential information, such as unique feature mappings or high-order dependencies, that cannot be effectively recovered by the remaining ones. Empirical evidence, such as accuracy degradation and increased generalization gaps, demonstrates that such loss leads to suboptimal model performance and slower convergence Li et al. (2023). In summary, while MRL enables hierarchical dimensionality reduction, its inability to selectively retain critical parameters and the inherent information degradation during post-truncation training ultimately undermine its effectiveness in maintaining model performance. In Section 3.3, we propose a more effective dimension pruning method.
<details>
<summary>figures/fig_ads.png Details</summary>

### Visual Description
\n
## Diagram: Dimension Pruning with Gumbel-Softmax
### Overview
This diagram illustrates a process of dimension pruning using a Gumbel-Softmax layer, followed by a Rank Loss calculation. The diagram shows two parallel processing paths: one with the full dimension (D dim) and another with a reduced dimension (D/2 dim) after pruning. The process involves a learnable parameter and top-k selection via Gumbel-Softmax, leading to dimension pruning and subsequent Rank Loss computation.
### Components/Axes
The diagram consists of the following components:
* **Learnable Parameter + topk select:** Represented by a purple rectangle.
* **Gumbel Softmax:** Indicated by a light blue arrow.
* **Dimension Pruning:** Represented by a large grey arrow.
* **Input Feature Matrix:** Represented by a green rectangle with labeled rows (X1 to Xi).
* **Rank Loss:** Represented by an orange rectangle.
* **Dimension Labels:** D dim and D/2 dim, indicating the dimensionality of the feature vectors.
* **Index Selection:** [1,3,ā¦,Xi-2, Xi-1]
### Detailed Analysis
The diagram shows two parallel paths.
**Top Path (Full Dimension):**
1. A "Learnable Parameter + topk select" (purple rectangle) feeds into a "Gumbel Softmax" layer (light blue arrow).
2. The output of the Gumbel Softmax is applied to an input feature matrix (green rectangle) with 'i' rows labeled X1 to Xi.
3. The feature matrix is then passed to a Rank Loss calculation (orange rectangle).
4. The output dimension is labeled as "D dim".
**Bottom Path (Pruned Dimension):**
1. The "Learnable Parameter + topk select" (purple rectangle) also feeds into a "Gumbel Softmax" layer (light blue arrow).
2. The output of the Gumbel Softmax is used for "Dimension Pruning" (grey arrow), reducing the input feature matrix.
3. The pruned feature matrix (green rectangle) with 'i-1' rows labeled X1 to Xi-1 is then passed to a Rank Loss calculation (orange rectangle).
4. The output dimension is labeled as "D/2 dim".
The index selection [1,3,ā¦,Xi-2, Xi-1] indicates that the pruning process selects specific dimensions (odd-numbered in this case) from the original feature matrix.
### Key Observations
* The diagram illustrates a method for reducing the dimensionality of feature vectors.
* The Gumbel-Softmax layer appears to be used for differentiable selection of dimensions.
* The Rank Loss is calculated on both the full-dimensional and pruned feature vectors.
* The pruning process appears to select a subset of the original dimensions, resulting in a reduced dimensionality.
### Interpretation
The diagram depicts a technique for learning which dimensions of a feature vector are most important. The Gumbel-Softmax layer provides a differentiable way to select a subset of dimensions, and the Rank Loss function encourages the model to learn a representation where the selected dimensions are more informative. The two parallel paths allow for comparison between the full-dimensional and pruned representations, potentially improving the efficiency and performance of the model. The selection of odd-numbered dimensions [1,3,ā¦,Xi-2, Xi-1] suggests a specific pruning strategy, but the diagram doesn't provide information on why this strategy is chosen. The use of Rank Loss implies that the goal is to learn a ranking of the dimensions based on their importance. This could be useful for feature selection, dimensionality reduction, or model compression.
</details>
Figure 3: The ADS module introduces a set of learnable parameters to dynamically select dimensions based on their importance during the dimensionality reduction process.
Sample Selection. The MRL framework employs supervised learning to jointly train high-dimensional ( $D$ ) and low-dimensional ( $D^{\prime}$ ) features. However, the number of available samples is limited by manual annotation. Matryoshka-Adaptor introduces in-batch sample mining strategies to expand the training sample scale, thereby addressing the inherent limitation. Specifically, it generates cross-sample pairs via the cartesian product of batch samples:
$$
\mathcal{P}=\{(x_{i},x_{j})\mid x_{i},x_{j}\in\text{Batch},\ i\neq j\}. \tag{3}
$$
This approach creates $B(B-1)$ pairs per batch (where $B$ denotes the batch size), enabling cross-sample comparisons within large batches. However, this indiscriminate pairing introduces noise from non-representative or irrelevant sample pairs.
In light of this limitation, the method employs Top- $k$ similarity-based selection:
$$
\begin{split}\mathcal{P}_{\text{top-}k}&=\text{Top}_{k}\left(\text{similarity}(x_{i},x_{j})\right),\\
&\quad\forall\ (x_{i},x_{j})\in\mathcal{P}.\end{split} \tag{4}
$$
Here, only the top- $k$ most similar pairs are retained for training, reducing computational overhead while focusing on informative interactions. Despite this improvement, the diversity of effective samples remains fundamentally constrained by the original batch size $B$ . In Section 3.4, we develop a strategy that empowers the model to mine global sample beyond the current batch.
### 3.2 Sequential Matryoshka Representation Learning
Applying the conclusions from Section 3.1 to the MRL training process, and take the parallel dimensionality reduction process $[D,D/2,D/4]$ as an example. The ratio of the average gradients for parameters $\mathbf{w}_{i}(i\in[0,D/4])$ and $\mathbf{w}_{j}(j\in[D/4,D/2])$ is as follows:
$$
\displaystyle\overline{\text{grad}_{i}} \displaystyle:\overline{\text{grad}_{j}}=\left(\frac{\partial\mathcal{L}^{D}}{\partial\mathbf{w}_{i}}+\frac{\partial\mathcal{L}^{D/2}}{\partial\mathbf{w}_{i}}+\frac{\partial\mathcal{L}^{D/4}}{\partial\mathbf{w}_{i}}\right) \displaystyle:\left(\frac{\partial\mathcal{L}^{D}}{\partial\mathbf{w}_{i}}+\frac{\partial\mathcal{L}^{D/2}}{\partial\mathbf{w}_{i}}\right)\approx 1+\frac{\delta(D/2)^{2}}{\delta(D/4)^{2}}. \tag{5}
$$
As shown in Equation 3.2, the average gradient magnitude of parameter $\mathbf{w}_{i}$ can be approximated as $1+\frac{\delta(D/2)^{2}}{\delta(D/4)^{2}}$ times that of parameter $\mathbf{w}_{j}$ , primarily due to the influence of the lower-dimensional loss function $\mathcal{L}^{D/4}$ . To resolve this issue, we propose Sequential Matryoshka Representation Learning (SMRL), which substitutes the original parallel compression of embeddings with a sequential approach, as illustrated in the Figure 2. Assuming a dimensionality reduction trajectory of $[D,D/2,D/4,\dots,D/2^{n}]$ . In each iteration, only the immediate transition (e.g., $D/2^{n-1}\rightarrow D/2^{n}$ ) is trained, avoiding the inclusion of lower-dimensional losses that amplify gradients for low-dimensional parameters. By eliminating the above factor, the gradients of $\mathbf{w}_{i}(i\in[0,D/2^{n}])$ follow a consistent distribution with reduced variance, improving convergence speed and performance. Once the loss converges in the current iteration, the dimensionality reduction $D/2^{n-1}\rightarrow D/2^{n}$ is complete, and the process proceeds to the next stage $D/2^{n}\rightarrow D/2^{n+1}$ , repeating the procedure until the target dimension is reached. Additionally, after convergence in one iteration, the optimal parameters for the current dimension are fixed to prevent subsequent reductions from degrading their performance. Notably, compared to MRL, the SMRL framework is more amenable to continued training. In scenarios where low-dimensional retrieval embeddings (e.g., D/8) or intermediate embeddings (e.g., D/3) are required, these can be obtained through further dimensionality reduction training based on the already preserved D/4 or D/2 parameters, eliminating the need for retraining from scratch as is typically required in MRL.
### 3.3 Adaptive Dimension Selection Module
Since directly truncating dimensions to obtain low-dimensional representations in MRL inevitably leads to information degradation, we propose the Adaptive Dimension Selection (ADS) module to dynamically identify important dimensions during training. As illustrated in Figure 3, we introduce a set of learnable parameters that represent the importance of different dimensions in the original representation $\mathbf{Z}(\text{dim}=D)$ , and use these parameters to perform dimensional sampling, obtaining a reduced-dimension representation $\mathbf{Z}^{\prime}(\text{dim}=D/2)$ . Since the sampling operation is non-differentiable, during the training phase, we utilize the Gumbel-Softmax Jang et al. (2016) to approximate the importance of different dimensions. This is achieved by adding Gumbel-distributed noise $G\sim\text{Gumbel}(0,1)$ to the logits parameters $\hat{\mathbf{z}}$ for each dimension, followed by applying the softmax function to the perturbed logits to approximate the one-hot vector representing dimension selection. Mathematically, this can be expressed as:
$$
\mathbf{z}=\text{softmax}_{\tau}(\hat{\mathbf{z}}+G). \tag{6}
$$
Importantly, the Gumbel approximation allows the softmax scores of dimension importance to be interpreted as the probability of selecting each dimension, rather than enforcing a deterministic selection of the top- $k$ dimensions. This achieves a fully differentiable reparameterization, transforming the selection of embedding dimensions into an optimizable process.
<details>
<summary>figures/fig3.png Details</summary>

### Visual Description
\n
## Diagram: System Architecture for Contrastive Learning
### Overview
This diagram illustrates the architecture of a system employing contrastive learning, likely for a language or image model. The system utilizes a frozen model, an XBM (Cross-Batch Memory), a score matrix, and a top-k sampling mechanism, culminating in a fully connected (FC) layer and a pair-based loss function. The diagram depicts a data flow from left to right, with feedback loops involving the XBM.
### Components/Axes
The diagram consists of the following components:
* **Frozen model:** A pre-trained model whose weights are not updated during training.
* **XBM (Cross-Batch Memory):** A memory bank used to store and retrieve embeddings.
* **Score matrix:** A matrix representing the similarity scores between embeddings.
* **Top-k sample:** A sampling method that selects the top k most similar embeddings.
* **FC Layer:** A fully connected layer.
* **Pair-based loss:** A loss function that encourages similar embeddings to be close and dissimilar embeddings to be far apart.
* **dequeue:** An operation to remove elements from a queue.
* **enqueue:** An operation to add elements to a queue.
* Arrows indicate the direction of data flow.
* Dashed boxes represent processing blocks or memory structures.
### Detailed Analysis or Content Details
1. **Frozen Model:** Located on the bottom-left, the "Frozen model" is a gray square. An arrow indicates data flows *from* this model.
2. **Enqueue to XBM:** The output of the Frozen model is fed into a dashed box containing multiple rectangular blocks, representing embeddings. An arrow labeled "enqueue" points from the Frozen model output to this box.
3. **XBM:** The dashed box containing embeddings is labeled "XBM". An arrow labeled "dequeue" points *from* this box to a multiplication symbol (ā).
4. **Score Matrix:** The output of the multiplication operation is fed into a grid-like structure labeled "Score matrix". This matrix appears to represent similarity scores between embeddings.
5. **Top-k Sample:** An arrow labeled "Top-k sample" points from the Score matrix to another dashed box containing embeddings.
6. **FC Layer:** The output of the Top-k sample is fed into a rectangular block labeled "FC Layer".
7. **Pair-based Loss:** The output of the FC Layer is fed into a rectangular block labeled "Pair-based loss". This represents the final loss calculation.
### Key Observations
* The XBM acts as a memory bank, storing embeddings from previous batches.
* The "dequeue" and "enqueue" operations suggest a queue-like structure for the XBM, allowing for continuous updates and retrieval of embeddings.
* The multiplication symbol (ā) likely represents a dot product or other similarity calculation between embeddings.
* The system appears to be designed to learn representations that capture similarity between data points.
### Interpretation
This diagram depicts a contrastive learning framework. The frozen model generates embeddings, which are stored in the XBM. The XBM allows the model to compare embeddings from different batches, enabling it to learn more robust and generalizable representations. The score matrix quantifies the similarity between embeddings, and the top-k sampling mechanism focuses on the most relevant comparisons. The FC layer and pair-based loss function then optimize the embeddings to maximize similarity between positive pairs and minimize similarity between negative pairs. The use of a frozen model suggests that the system is focused on learning representations *without* modifying the underlying model weights, potentially for transfer learning or few-shot learning scenarios. The XBM is a key component, enabling the model to leverage information from past batches to improve its learning process. The entire architecture is designed to learn a meaningful embedding space where similar data points are close together and dissimilar data points are far apart.
</details>
Figure 4: S-XBM maintains a queue during training to store historical features across batches. Rather than incorporating all stored features into the current batch, it selectively leverages hard samples that exhibit high similarity to the current batch samples.
### 3.4 Selectable Cross-Batch Memory
A natural teacher-student relationship inherently exists between the original embedding and its reduced-dimensional counterpart, making it feasible to improve the compressed embedding through unsupervised learning Yoon et al. (2024). However, as discussed in Section 3.1, performing this process within a single batch suffers from sample noise and insufficient diversity. As illustrated in Figure 4, we propose the Selectable Cross-Batch Memory (S-XBM) module, which constructs a first-in-first-out (FIFO) queue during training to store original embeddings across batches, with the aim of addressing this limitation. Unlike the original XBM Wang et al. (2020), we introduce two task-specific improvements: (1) retrieving only the topā $k$ most similar samples from the memory bank to construct new batches, and (2) deferring the trainable FC layer and only storing features generated by the frozen backbone, thereby avoiding feature drift. The unsupervised loss between original embedding $emb$ and low-dimensional embedding $emb[:d]$ is as follows:
$$
\displaystyle{\mathcal{L}_{un-sup}} \displaystyle=\sum_{i}\sum_{j\in\mathcal{N}_{K}(i)}\left|\text{Sim}(emb_{i},emb_{j})\right. \displaystyle\quad\left.-\text{Sim}(emb_{i}[:d],emb_{j}[:d])\right| \tag{7}
$$
where $\mathcal{N}_{K}(i)$ denotes the set of the top $k$ most similar embeddings to $emb_{i}$ within the S-XBM module.
<details>
<summary>figures/fig_openai.png Details</summary>

### Visual Description
## Line Chart: NDCG@10% vs. Dimensions
### Overview
This line chart visualizes the relationship between the dimensions of a model and its NDCG@10% performance. Four different model configurations are compared: Original(MRL), search-adaptor, MRL-Adaptor, and SMEC. The chart shows how the NDCG@10% metric changes as the number of dimensions increases from 128 to 3072.
### Components/Axes
* **X-axis:** Dimensions, with markers at 128, 256, 512, 768, 1536, and 3072.
* **Y-axis:** NDCG@10%, ranging from approximately 50% to 62%.
* **Legend:** Located in the bottom-right corner, identifying the four data series:
* Original(MRL) - Blue line with circular markers.
* search-adaptor - Orange line with cross markers.
* MRL-Adaptor - Green line with triangular markers.
* SMEC - Yellow line with square markers.
* **Gridlines:** A light gray grid is present to aid in reading values.
### Detailed Analysis
Here's a breakdown of each data series, with approximate values read from the chart:
* **Original(MRL) (Blue):** This line shows a generally upward trend, but with diminishing returns.
* 128 Dimensions: ~51.5%
* 256 Dimensions: ~54.5%
* 512 Dimensions: ~55.5%
* 768 Dimensions: ~56.0%
* 1536 Dimensions: ~56.0%
* 3072 Dimensions: ~56.0%
* **search-adaptor (Orange):** This line exhibits a strong initial increase, then plateaus.
* 128 Dimensions: ~52.0%
* 256 Dimensions: ~59.0%
* 512 Dimensions: ~60.5%
* 768 Dimensions: ~60.5%
* 1536 Dimensions: ~61.0%
* 3072 Dimensions: ~61.0%
* **MRL-Adaptor (Green):** This line shows a consistent upward trend, reaching the highest values.
* 128 Dimensions: ~54.5%
* 256 Dimensions: ~59.0%
* 512 Dimensions: ~59.5%
* 768 Dimensions: ~60.5%
* 1536 Dimensions: ~61.5%
* 3072 Dimensions: ~61.5%
* **SMEC (Yellow):** This line demonstrates a rapid increase initially, then levels off, similar to search-adaptor.
* 128 Dimensions: ~56.0%
* 256 Dimensions: ~59.0%
* 512 Dimensions: ~59.5%
* 768 Dimensions: ~60.5%
* 1536 Dimensions: ~61.0%
* 3072 Dimensions: ~61.0%
### Key Observations
* The MRL-Adaptor consistently outperforms the other models across all dimensions.
* The Original(MRL) model shows the lowest performance and the smallest improvement with increasing dimensions.
* The search-adaptor and SMEC models exhibit similar performance, with a significant initial gain that plateaus at higher dimensions.
* All models show diminishing returns in NDCG@10% as the number of dimensions increases beyond 768.
### Interpretation
The data suggests that increasing the dimensionality of the model generally improves performance, as measured by NDCG@10%, up to a certain point. However, the benefit of adding more dimensions diminishes as the dimensionality increases. The MRL-Adaptor configuration appears to be the most effective, consistently achieving the highest NDCG@10% scores. The Original(MRL) model, without adaptation, is the least effective.
The plateauing effect observed in the search-adaptor and SMEC models suggests that the models may be reaching a point of saturation, where adding more dimensions does not provide significant additional information or improve the model's ability to discriminate between relevant and irrelevant results. The differences between the models likely stem from the specific adaptation techniques employed (search-adaptor, MRL-Adaptor, SMEC) and their ability to leverage the increased dimensionality effectively. The initial steep increase in performance for search-adaptor and SMEC could be due to the model learning more robust features with the initial increase in dimensions, but the subsequent plateau indicates that the added complexity doesn't translate into further gains.
</details>
(a) OpenAI text embeddings
<details>
<summary>figures/fig_llm2vec.png Details</summary>

### Visual Description
## Line Chart: NDCG@10% vs. Dimensions
### Overview
This image presents a line chart comparing the performance of four different methods ā Original(PCA), Search-Adaptor, MRL-Adaptor, and SMEC ā across varying dimensions. The performance metric is NDCG@10%, which likely represents Normalized Discounted Cumulative Gain at 10%. The chart illustrates how the performance of each method changes as the dimensionality of the data increases.
### Components/Axes
* **X-axis:** Dimensions, ranging from 128 to 3584. The axis markers are: 128, 256, 512, 768, 1536, 3584.
* **Y-axis:** NDCG@10%, ranging from 30 to 60. The axis markers are: 30, 35, 40, 45, 50, 55, 60.
* **Legend:** Located in the top-right corner of the chart.
* Original(PCA) - Blue circle
* Search-Adaptor - Orange cross
* MRL-Adaptor - Green triangle
* SMEC - Yellow square
### Detailed Analysis
* **Original(PCA):** This line (blue) starts at approximately 31% at 128 dimensions and steadily increases to approximately 54% at 3584 dimensions. The line exhibits a consistent upward slope, indicating improved performance with increasing dimensions.
* 128 Dimensions: ~31%
* 256 Dimensions: ~37%
* 512 Dimensions: ~46%
* 768 Dimensions: ~50%
* 1536 Dimensions: ~53%
* 3584 Dimensions: ~54%
* **Search-Adaptor:** This line (orange) begins at approximately 50% at 128 dimensions and rises to approximately 61% at 3584 dimensions. The slope is initially steeper, then flattens out.
* 128 Dimensions: ~50%
* 256 Dimensions: ~56%
* 512 Dimensions: ~59%
* 768 Dimensions: ~60%
* 1536 Dimensions: ~60%
* 3584 Dimensions: ~61%
* **MRL-Adaptor:** This line (green) starts at approximately 55% at 128 dimensions and increases to approximately 60% at 3584 dimensions. The line is relatively flat, indicating minimal performance improvement with increasing dimensions.
* 128 Dimensions: ~55%
* 256 Dimensions: ~57%
* 512 Dimensions: ~58%
* 768 Dimensions: ~59%
* 1536 Dimensions: ~60%
* 3584 Dimensions: ~60%
* **SMEC:** This line (yellow) begins at approximately 55% at 128 dimensions and rises to approximately 61% at 3584 dimensions. The line is relatively flat, similar to MRL-Adaptor, but slightly higher in overall performance.
* 128 Dimensions: ~55%
* 256 Dimensions: ~57%
* 512 Dimensions: ~59%
* 768 Dimensions: ~60%
* 1536 Dimensions: ~60%
* 3584 Dimensions: ~61%
### Key Observations
* The Original(PCA) method shows the most significant improvement in NDCG@10% as the number of dimensions increases.
* Search-Adaptor and SMEC achieve the highest NDCG@10% scores, consistently above 60% at higher dimensions.
* MRL-Adaptor exhibits the least improvement with increasing dimensions, plateauing around 60%.
* The performance gap between the methods narrows at higher dimensions.
### Interpretation
The data suggests that increasing the dimensionality of the data generally improves the performance of these methods, as measured by NDCG@10%. However, the rate of improvement varies significantly. The Original(PCA) method benefits the most from increased dimensionality, while the Search-Adaptor, MRL-Adaptor, and SMEC methods show diminishing returns.
The fact that Search-Adaptor and SMEC consistently outperform the other methods, particularly at higher dimensions, suggests that these methods are more effective at leveraging the information contained in higher-dimensional data. The plateauing of MRL-Adaptor indicates that it may be limited by its ability to effectively utilize the increased dimensionality.
The Original(PCA) method's consistent improvement suggests that it is a good baseline approach, but it may not be able to achieve the same level of performance as the more sophisticated adaptation methods. The convergence of the lines at higher dimensions suggests that there may be a point of diminishing returns, where further increasing the dimensionality does not lead to significant performance gains. This could be due to factors such as the curse of dimensionality or the limitations of the evaluation metric.
</details>
(b) LLM2Vec
Figure 5: Experimental results on the BEIR dataset comparing two models: OpenAIās text-embedding-3-large (with 3072 dimensions) and LLM2Vec (with 3548 dimensions), the latter built upon the Qwen2-7B model. OpenAI text embeddings inherently contain multi-scale representations (enabled by MRL during pretraining), while LLM2Vec obtains its orignal low-dimensional representations via PCA.
## 4 Experiments
In this section, we compare our approach with state-of-the-art methods in the field of embedding dimensionality reduction.
### 4.1 Dataset Description
We evaluate the modelās retrieval performance across diverse datasets: BEIR Thakur et al. (2021) (text retrieval), Products-10K Bai et al. (2020) (image retrieval), and Fashion-200K Han et al. (2017) (cross-modal retrieval). BEIR is a comprehensive text retrieval benchmark consisting of 13 selected datasets from diverse domains. Products-10K contains approximately 10,000 products with over 150,000 images for large-scale product image retrieval. Fashion-200K includes over 200,000 fashion items with paired image-text data for cross-modal tasks.
### 4.2 Implementation Details
We use state-of-the-art models to extract the original embeddings for different datasets. Specifically, the BEIR dataset employs OpenAI text embeddings ope and LLM2Vec BehnamGhader et al. (2024) for text representation; the Products-10K dataset utilizes LLM2CLIP Huang et al. (2024) to obtain cross-modal embeddings; and the Fashion-200K dataset extracts image embeddings using the ViT-H Dosovitskiy et al. (2020) model. All dimensionality reduction methods are performed based on these original representations. To align with other methods, SMEC also adopts rank loss Yoon et al. (2023) as the supervised loss function, which is defined as follows:
$$
\displaystyle\mathcal{L}_{rank} \displaystyle=\sum_{i}\sum_{j}\sum_{k}\sum_{m}I(y_{ij}>y_{ik})(y_{ij}-y_{ik}) \displaystyle\log(1+\exp(s_{ik}[:m]-s_{ij}[:m])), \tag{8}
$$
where $I(y_{ij}>y_{ik})$ is an indicator function that is equal to 1 if $y_{ij}$ > $y_{ik}$ and 0 otherwise. $s_{ij}[:m]$ represents the cosine similarity between the query embedding $emb_{i}[:m]$ and the corpus embedding $emb_{j}[:m]$ . The total loss function is:
$$
\displaystyle\mathcal{L}_{total}=\mathcal{L}_{rank}+\alpha\mathcal{L}_{un-sup}, \tag{9}
$$
with $\alpha$ being hyper-parameters with fixed values as $\alpha=1.0$ . As SMEC involves multi-stage training, the training epochs of other methods are aligned with the total number of epochs costed by SMEC, and their best performance is reported.
### 4.3 Results
In this subsection, the results on the BEIR, Fashion-200K, and Products-10K datasets are given. Retrieval performance is evaluated using the normalized discounted cumulative gain at rank 10 (nDCG@10) Kalervo et al. (2002) metric.
BEIR. As shown in Figure 5, we compare the performance of SMEC and other state-of-the-art methods on two types of models: the API-based OpenAI text embedding and the open-source LLM2vec, across various compressed dimensions. Significantly, SMEC exhibits the strongest performance retention, particularly at lower compression ratios. For example, when compressed to 128 dimensions, SMEC improves the performance of the OpenAI and LLM2vec models by 1.9 and 1.1 points respectively, compared to the best-performing Matryoshka-Adaptor.
Products-10K. Images naturally contain denser features than text O Pinheiro et al. (2020). As shown in Figure 8(a) of Appendix C, SMEC surpasses other dimensionality reduction methods in image retrieval tasks, highlighting the effectiveness of the ADS module in mitigating information degradation during dimension pruning.
Fashion-200K. Unlike unimodal datasets, Fashion-200K involves cross-modal queries and documents, such as image-to-text and text-to-image retrieval. As illustrated in the Figure 8(b) and 8(c) of Appendix C, SMEC achieves superior performance in both directions, demonstrating strong robustness in multimodal scenarios.
<details>
<summary>figures/fig_var.png Details</summary>

### Visual Description
\n
## Line Chart: Gradient Variance vs. Epochs
### Overview
The image presents a line chart illustrating the relationship between Gradient Variance and Epochs for two different methods: MRL and SMRL. The chart displays how the gradient variance changes as the training progresses through epochs. The y-axis is on a logarithmic scale.
### Components/Axes
* **X-axis:** Epochs, ranging from 0 to 40.
* **Y-axis:** Gradient Variance, on a logarithmic scale from approximately 1e-7 to 1e-9.
* **Data Series 1:** MRL (black line)
* **Data Series 2:** SMRL (light blue line)
* **Legend:** Located in the top-right corner, labeling the two data series.
* **Grid:** A light gray grid is present in the background to aid in reading values.
### Detailed Analysis
**MRL (Black Line):**
The MRL line starts at approximately 1.1e-7 at Epoch 0. It initially decreases to around 8e-8 at Epoch 5, then continues to decrease, reaching a minimum of approximately 7e-9 at Epoch 25. After Epoch 25, the line exhibits a slight increase, fluctuating between 7e-9 and 9e-9 until Epoch 40, where it ends at approximately 8.5e-9.
**SMRL (Light Blue Line):**
The SMRL line begins at approximately 5e-7 at Epoch 0. It shows a rapid decrease initially, falling to around 2e-8 by Epoch 5. The line continues to descend, reaching a minimum of approximately 5e-9 at Epoch 30. From Epoch 30 to 40, the line fluctuates, ending at approximately 6e-9.
**Data Points (Approximate):**
| Epoch | MRL (Gradient Variance) | SMRL (Gradient Variance) |
|-------|--------------------------|---------------------------|
| 0 | 1.1e-7 | 5e-7 |
| 5 | 8e-8 | 2e-8 |
| 10 | 5e-8 | 1e-8 |
| 15 | 3e-8 | 7e-9 |
| 20 | 2e-8 | 6e-9 |
| 25 | 7e-9 | 5.5e-9 |
| 30 | 8e-9 | 5e-9 |
| 35 | 8.5e-9 | 6e-9 |
| 40 | 8.5e-9 | 6e-9 |
### Key Observations
* Both MRL and SMRL exhibit a decreasing trend in gradient variance as the number of epochs increases, indicating convergence during training.
* SMRL consistently demonstrates lower gradient variance compared to MRL across all epochs.
* The rate of decrease in gradient variance is steeper for SMRL, especially in the initial epochs.
* MRL shows a slight increase in gradient variance after Epoch 25, while SMRL remains relatively stable.
### Interpretation
The chart suggests that the SMRL method is more effective at reducing gradient variance during training compared to the MRL method. The lower gradient variance achieved by SMRL indicates a more stable and potentially faster convergence process. The initial rapid decrease in gradient variance for SMRL suggests that it quickly finds a good direction for parameter updates. The slight increase in gradient variance for MRL after Epoch 25 could indicate that the method is approaching a local minimum or experiencing oscillations. The logarithmic scale of the y-axis emphasizes the significant reduction in gradient variance achieved by both methods, but particularly by SMRL. This data likely represents the results of a machine learning training process, where the goal is to minimize the gradient variance to achieve optimal model performance.
</details>
(a) Gradient Variance
<details>
<summary>figures/fig_loss.png Details</summary>

### Visual Description
\n
## Line Chart: Loss vs. Epochs for MRL and SMRL
### Overview
This image presents a line chart illustrating the relationship between 'Loss' and 'Epochs' for two different models: MRL and SMRL. The chart displays how the loss function changes as the models are trained over a number of epochs.
### Components/Axes
* **X-axis:** 'Epochs', ranging from 0 to 40, with tick marks at intervals of 5.
* **Y-axis:** 'Loss', ranging from 0.04 to 0.10, with tick marks at intervals of 0.01.
* **Data Series:** Two lines representing the loss values for MRL and SMRL.
* **Legend:** Located in the top-right corner, identifying the lines as 'MRL' (black) and 'SMRL' (cyan).
* **Grid:** A light gray grid is present to aid in reading values.
### Detailed Analysis
**MRL (Black Line):**
The MRL line starts at approximately 0.095 and generally slopes downward.
* Epoch 0: Loss ā 0.095
* Epoch 5: Loss ā 0.075
* Epoch 10: Loss ā 0.065
* Epoch 15: Loss ā 0.062
* Epoch 20: Loss ā 0.061
* Epoch 25: Loss ā 0.060
* Epoch 30: Loss ā 0.059
* Epoch 35: Loss ā 0.058
* Epoch 40: Loss ā 0.057
The line exhibits fluctuations after Epoch 20, indicating a potential plateau or oscillation in the learning process.
**SMRL (Cyan Line):**
The SMRL line begins at approximately 0.098 and demonstrates a steeper initial descent compared to MRL.
* Epoch 0: Loss ā 0.098
* Epoch 5: Loss ā 0.065
* Epoch 10: Loss ā 0.055
* Epoch 15: Loss ā 0.051
* Epoch 20: Loss ā 0.049
* Epoch 25: Loss ā 0.048
* Epoch 30: Loss ā 0.047
* Epoch 35: Loss ā 0.046
* Epoch 40: Loss ā 0.045
The SMRL line appears to converge more rapidly and smoothly than the MRL line, with less fluctuation after Epoch 20.
### Key Observations
* SMRL consistently exhibits lower loss values than MRL across all epochs.
* Both models demonstrate a decreasing loss trend, indicating successful learning.
* The rate of loss reduction slows down for both models as the number of epochs increases, suggesting convergence.
* MRL shows more pronounced oscillations in the loss function after Epoch 20, potentially indicating instability or a need for further tuning.
### Interpretation
The chart suggests that the SMRL model outperforms the MRL model in terms of loss reduction during training. The faster convergence and smoother loss curve of SMRL indicate that it learns more efficiently and potentially generalizes better than MRL. The oscillations observed in MRL's loss curve after Epoch 20 could be due to factors such as a high learning rate, noisy data, or an inadequate model capacity. Further investigation into the training process and model architecture may be necessary to address these issues and improve the performance of MRL. The data demonstrates a clear trend of diminishing returns as training progresses, highlighting the importance of early stopping or other regularization techniques to prevent overfitting. The difference in performance between the two models suggests that the specific techniques employed in SMRL are more effective for this particular task or dataset.
</details>
(b) Validation loss
<details>
<summary>figures/fig_ndcg.png Details</summary>

### Visual Description
## Line Chart: NDCG@10 vs. Epochs for MRL and SMRL
### Overview
This image presents a line chart comparing the performance of two models, MRL and SMRL, over 40 epochs. The performance metric is NDCG@10, plotted on the y-axis, against the number of epochs on the x-axis. The chart shows how the NDCG@10 score changes as the models are trained for more epochs.
### Components/Axes
* **X-axis:** "Epochs", ranging from 0 to 40.
* **Y-axis:** "NDCG@10", ranging from 0.41 to 0.47.
* **Legend:** Located in the top-left corner, identifying two lines:
* "MRL" - represented by a black line.
* "SMRL" - represented by a teal (cyan-blue) line.
* **Grid:** A light gray grid is overlaid on the chart for easier readability.
### Detailed Analysis
**MRL (Black Line):**
The MRL line starts at approximately 0.428 at epoch 0. It generally decreases until around epoch 10, reaching a minimum of approximately 0.412. From epoch 10 to epoch 25, the line exhibits an upward trend, increasing to around 0.458. Between epochs 25 and 40, the line fluctuates, oscillating between approximately 0.458 and 0.465, ending at approximately 0.463 at epoch 40.
**SMRL (Teal Line):**
The SMRL line begins at approximately 0.465 at epoch 0. It initially decreases rapidly to a minimum of approximately 0.425 at epoch 5. From epoch 5 to epoch 20, the line shows an increasing trend, reaching approximately 0.445 at epoch 20. Between epochs 20 and 30, the line increases sharply to a peak of approximately 0.472 at epoch 30. From epoch 30 to 40, the line fluctuates around 0.47, ending at approximately 0.471 at epoch 40.
**Data Points (Approximate):**
| Epoch | MRL | SMRL |
|-------|---------|---------|
| 0 | 0.428 | 0.465 |
| 5 | 0.424 | 0.425 |
| 10 | 0.412 | 0.430 |
| 15 | 0.422 | 0.440 |
| 20 | 0.440 | 0.445 |
| 25 | 0.458 | 0.460 |
| 30 | 0.460 | 0.472 |
| 35 | 0.463 | 0.470 |
| 40 | 0.463 | 0.471 |
### Key Observations
* SMRL consistently outperforms MRL across all epochs, maintaining a higher NDCG@10 score.
* Both models exhibit initial decreases in performance before improving with training.
* SMRL shows a more rapid increase in performance between epochs 20 and 30 compared to MRL.
* Both models appear to converge in performance after epoch 30, with fluctuations around their peak values.
### Interpretation
The chart demonstrates the training progress of two ranking models, MRL and SMRL, using the NDCG@10 metric. The initial decrease in performance for both models could be attributed to the model adjusting to the training data and potentially overfitting to initial batches. The subsequent increase indicates learning and improvement in ranking quality. The superior performance of SMRL suggests that its architecture or training process is more effective at learning the underlying ranking function. The convergence of the lines after epoch 30 suggests that both models are approaching their maximum performance potential, and further training may yield diminishing returns. The fluctuations in the later epochs could be due to the stochastic nature of the training process or the inherent variability in the evaluation data. The data suggests that SMRL is a more effective model for this ranking task, but both models benefit from continued training up to a certain point.
</details>
(c) Retrieval performance
Figure 6: Analysis of metrics during the training process. (a) shows the gradient variance curve (with the vertical axis in a logarithmic scale), (b) presents the loss curve on the validation set, and (c) illustrates the performance variations on the test set. As training progresses, the gradient variances of both MRL and SMRL decrease; however, the gradient variance of MRL remains several times higher than that of SMRL. Consequently, the loss curve of SMRL converges more quickly to a lower value, and the compressed embedding demonstrates better retrieval performance.
## 5 Discussions
### 5.1 The influence of gradient variance
To validate the impact of gradient variance on convergence speed and model performance (as discussed in Section 3.2), we conducted comparative experiments between SMRL and MRL using the MiniLM model on the BEIR dataset. As shown in Figure 6(a), MRL consistently exhibits significantly higher gradient variance than SMRL throughout training. Consequently, the training loss of MRL continues to decline beyond the 20th epoch, whereas SMRLās loss starts to converge at the 15th epoch. A similar trend is observed in subfigure 6(c), where SMRL enters the improvement phase earlier and converges to superior performance.
<details>
<summary>figures/fig_discuss2_rank.png Details</summary>

### Visual Description
\n
## Chart: Gradient Size and Variance vs. Epochs
### Overview
The image presents a chart illustrating the relationship between Epochs and both Gradient Size and Gradient Variance. The chart uses a dual y-axis, with the left axis representing Gradient Size (on a logarithmic scale) and the right axis representing Variance (also on a logarithmic scale). Multiple lines depict the Average Gradient for different configurations, and two lines show the Gradient Variance for different configurations.
### Components/Axes
* **X-axis:** Epochs, ranging from approximately -2 to 32.
* **Y-axis (Left):** Gradient Size, logarithmic scale from 10<sup>-2</sup> to 10<sup>1</sup>. Labeled "Gradient Size".
* **Y-axis (Right):** Variance, logarithmic scale from 10<sup>-10</sup> to 10<sup>-7</sup>. Labeled "var".
* **Legend:** Located in the top-right corner. Contains the following entries with corresponding colors:
* Average Gradient (Ļ<sub>i,jā[0.96]</sub>,SMRL) - Light Blue
* Average Gradient (Ļ<sub>i,jā[96,192]</sub>,SMRL) - Pale Orange
* Average Gradient (Ļ<sub>i,jā[0.96]</sub>,MRL) - Light Brown
* Average Gradient (Ļ<sub>i,jā[96,192]</sub>,MRL) - Dark Brown
* Gradient Variance (Ļ<sub>k,kā[0.192]</sub>,SMRL) - Blue Circles
* Gradient Variance (Ļ<sub>k,kā[0.192]</sub>,MRL) - Red Squares
### Detailed Analysis
The chart displays several lines representing different configurations.
**Average Gradient Lines:**
* **Light Blue Line (Ļ<sub>i,jā[0.96]</sub>,SMRL):** Starts at approximately 3.666 x 10<sup>-7</sup> at Epoch -2, decreases rapidly to approximately 1.188 x 10<sup>-7</sup> at Epoch 0, and continues to decrease to approximately 0.038 x 10<sup>-7</sup> (3.8 x 10<sup>-9</sup>) at Epoch 30. The line exhibits a steep downward slope initially, which gradually flattens.
* **Pale Orange Line (Ļ<sub>i,jā[96,192]</sub>,SMRL):** Starts at approximately 1.198 x 10<sup>-7</sup> at Epoch -2, decreases to approximately 1.054 x 10<sup>-7</sup> at Epoch 0, and continues to decrease to approximately 0.013 x 10<sup>-7</sup> (1.3 x 10<sup>-9</sup>) at Epoch 30. Similar to the light blue line, it shows a decreasing trend with a flattening slope.
* **Light Brown Line (Ļ<sub>i,jā[0.96]</sub>,MRL):** Starts at approximately 8.84 x 10<sup>-8</sup> at Epoch -2, decreases to approximately 1.188 x 10<sup>-7</sup> at Epoch 0, and continues to decrease to approximately 0.025 x 10<sup>-7</sup> (2.5 x 10<sup>-9</sup>) at Epoch 30.
* **Dark Brown Line (Ļ<sub>i,jā[96,192]</sub>,MRL):** Starts at approximately 3.666 x 10<sup>-7</sup> at Epoch -2, decreases to approximately 0.081 x 10<sup>-7</sup> (8.1 x 10<sup>-9</sup>) at Epoch 20, and continues to decrease to approximately 0.013 x 10<sup>-7</sup> (1.3 x 10<sup>-9</sup>) at Epoch 30.
**Gradient Variance Lines:**
* **Blue Circles (Ļ<sub>k,kā[0.192]</sub>,SMRL):** Starts at approximately 2.27e-8 at Epoch 10, decreases to approximately 1.48e-9 at Epoch 20, and continues to decrease to approximately 0.024 x 10<sup>-9</sup> (2.4 x 10<sup>-11</sup>) at Epoch 30. The line shows a consistent downward trend.
* **Red Squares (Ļ<sub>k,kā[0.192]</sub>,MRL):** Starts at approximately 0.184 x 10<sup>-7</sup> (1.84 x 10<sup>-8</sup>) at Epoch 10, decreases to approximately 4.93e-9 at Epoch 20, and continues to decrease to approximately 0.038 x 10<sup>-9</sup> (3.8 x 10<sup>-11</sup>) at Epoch 30. The line also shows a consistent downward trend.
### Key Observations
* All lines exhibit a decreasing trend as Epochs increase, indicating that both Gradient Size and Gradient Variance generally decrease with training.
* The Average Gradient lines (all colors) are significantly higher in magnitude than the Gradient Variance lines (blue and red).
* The lines representing the MRL configuration (light brown and dark brown) generally have lower values than those representing the SMRL configuration (light blue and pale orange) for Average Gradient.
* The Gradient Variance lines for both SMRL and MRL configurations show similar decreasing trends.
* The initial values for the Average Gradient lines are quite high, but they rapidly decrease with increasing Epochs.
### Interpretation
The chart demonstrates the typical behavior of gradient descent during training. As the model trains (Epochs increase), the gradients generally decrease in size, indicating that the model is converging towards a minimum of the loss function. The decreasing variance suggests that the gradients are becoming more consistent, which is also a sign of convergence.
The differences between the SMRL and MRL configurations suggest that the SMRL configuration may require larger gradients initially, but both configurations ultimately converge to similar levels of gradient size and variance. The logarithmic scales used for both axes emphasize the significant reduction in gradient size and variance over the training process. The initial high gradient values could indicate a large initial error or a steep learning rate. The flattening of the curves towards the end of the training process suggests that the model is approaching a stable state. The consistent downward trend in variance indicates that the training process is becoming more stable and less sensitive to individual data points.
</details>
(a) Rank Loss
<details>
<summary>figures/fig_discuss2_mse.png Details</summary>

### Visual Description
\n
## Chart: Gradient Size vs. Epochs with Variance
### Overview
The image presents a chart illustrating the relationship between gradient size and epochs during a training process. It displays four lines representing the average gradient for different parameter sets, along with two lines representing the gradient variance for those same sets. The chart uses a logarithmic scale for the gradient size (y-axis) and a linear scale for epochs (x-axis). A heatmap in the background indicates the variance.
### Components/Axes
* **X-axis:** Epochs (linear scale, ranging from approximately -2 to 35)
* **Y-axis:** Gradient Size (logarithmic scale, ranging from approximately 1e-8 to 1e+0)
* **Legend:**
* Average Gradient (Ļ<sub>i,jā[0.96]</sub>, SMRL) - Light Blue
* Average Gradient (Ļ<sub>i,jā[96,192]</sub>, SMRL) - Orange
* Average Gradient (Ļ<sub>i,jā[0.96]</sub>, MRL) - Yellow
* Average Gradient (Ļ<sub>i,jā[96,192]</sub>, MRL) - Red
* Gradient Variance (Ļ<sub>k,kā[0.192]</sub>, SMRL) - Blue
* Gradient Variance (Ļ<sub>k,kā[0.192]</sub>, MRL) - Brown
* **Heatmap:** Background color representing variance, with a colorbar on the right indicating the variance scale (ranging from approximately 1e-5 to 1e-8).
### Detailed Analysis
The chart displays six lines, each representing a different metric.
**Average Gradient Lines:**
* **Light Blue (Ļ<sub>i,jā[0.96]</sub>, SMRL):** The line starts at approximately 1.124 at epoch -2 and decreases rapidly to approximately 0.025 at epoch 35.
* **Orange (Ļ<sub>i,jā[96,192]</sub>, SMRL):** The line begins at approximately 1.993 at epoch -2 and decreases to approximately 0.023 at epoch 35.
* **Yellow (Ļ<sub>i,jā[0.96]</sub>, MRL):** The line starts at approximately 5.32e-5 at epoch -2 and decreases to approximately 6.51e-9 at epoch 35.
* **Red (Ļ<sub>i,jā[96,192]</sub>, MRL):** The line begins at approximately 1.51e-5 at epoch -2 and decreases to approximately 6.15e-9 at epoch 35.
**Gradient Variance Lines:**
* **Blue (Ļ<sub>k,kā[0.192]</sub>, SMRL):** The line starts at approximately 2.717 at epoch -2 and decreases to approximately 0.062 at epoch 35.
* **Brown (Ļ<sub>k,kā[0.192]</sub>, MRL):** The line begins at approximately 1.037 at epoch -2 and decreases to approximately 0.025 at epoch 35.
**Specific Data Points (Approximate):**
* Epoch 0: Gradient Sizes: ~1.124, ~1.993, ~5.32e-5, ~1.51e-5, ~2.717, ~1.037
* Epoch 10: Gradient Sizes: ~0.088, ~0.18, ~2.43e-7, ~0.078, ~0.077, ~0.039
* Epoch 20: Gradient Sizes: ~0.039, ~0.04, ~2.09e-8, ~0.037, ~4.68e-8, ~0.03
* Epoch 30: Gradient Sizes: ~0.025, ~0.023, ~6.51e-9, ~6.15e-9, ~0.062, ~0.025
### Key Observations
* All lines exhibit a decreasing trend, indicating that both average gradient size and gradient variance decrease as the number of epochs increases.
* The average gradient lines (light blue, orange, yellow, red) are generally higher in magnitude than the gradient variance lines (blue, brown).
* The lines representing SMRL (light blue and orange) start at higher values than those representing MRL (yellow and red).
* The heatmap shows a gradient of variance, with higher variance values (warmer colors) at the beginning of training and lower variance values (cooler colors) as training progresses.
* The variance lines show a similar decreasing trend, but the magnitude of the decrease is less pronounced than that of the average gradient lines.
### Interpretation
The chart demonstrates the typical behavior of gradient descent during training. As the model trains (epochs increase), the gradients generally decrease in size, indicating that the model is converging towards a minimum of the loss function. The decreasing gradient variance suggests that the training process is becoming more stable.
The difference between SMRL and MRL likely represents different training configurations or datasets. The higher initial gradient sizes for SMRL suggest that it may require more epochs to converge or that it is learning at a faster rate initially.
The heatmap provides a visual representation of the variance in the gradients, which can be used to assess the stability of the training process. The decreasing variance over time suggests that the training process is becoming more stable as the model converges.
The data suggests that the training process is progressing as expected, with gradients decreasing and variance stabilizing over time. The differences between SMRL and MRL warrant further investigation to understand the impact of different training configurations on model performance. The logarithmic scale on the y-axis emphasizes the rapid initial decrease in gradient size, followed by a slower decrease as the model approaches convergence.
</details>
(b) MSE Loss
<details>
<summary>figures/fig_discuss2_ce.png Details</summary>

### Visual Description
\n
## Chart: Gradient Size and Variance vs. Epochs
### Overview
The image presents a dual-axis chart illustrating the relationship between Epochs and both Gradient Size and Variance. Four lines represent Average Gradient for different configurations, while two lines represent Gradient Variance for different configurations. The chart uses a logarithmic scale for the Gradient Size (y-axis on the left) and a linear scale for Variance (y-axis on the right).
### Components/Axes
* **X-axis:** Epochs, ranging from approximately -2 to 35.
* **Left Y-axis:** Gradient Size, on a logarithmic scale (base 10). Scale ranges from approximately 10<sup>-8</sup> to 10<sup>-4</sup>. Label: "Gradient Size".
* **Right Y-axis:** Variance (var), on a linear scale. Scale ranges from approximately 1x10<sup>-7</sup> to 1x10<sup>-4</sup>. Label: "var".
* **Legend:** Located in the top-right corner. Contains the following entries:
* Average Gradient (Ļ<sub>i,j,e</sub>[0.96],SMRL) - Blue line
* Average Gradient (Ļ<sub>i,j,e</sub>[96,192],SMRL) - Orange line
* Average Gradient (Ļ<sub>i,j,e</sub>[0.96],MRL) - Teal line
* Average Gradient (Ļ<sub>i,j,e</sub>[96,192],MRL) - Yellow line
* Gradient Variance (Ļ<sub>k,k,e</sub>[0.192],SMRL) - Blue circles
* Gradient Variance (Ļ<sub>k,k,e</sub>[0.192],MRL) - Orange squares
### Detailed Analysis
**Average Gradient Lines:**
* **Blue Line (Ļ<sub>i,j,e</sub>[0.96],SMRL):** The line slopes sharply downward from approximately 2.11e-4 at Epoch -2 to approximately 9.17e-8 at Epoch 30. Data points: (-2, 2.11e-4), (0, 7.17e-5), (10, 2.24e-6), (20, 1.64e-7), (30, 9.17e-8).
* **Orange Line (Ļ<sub>i,j,e</sub>[96,192],SMRL):** The line slopes downward, but less steeply than the blue line, starting at approximately 2.381 at Epoch -2 and ending at approximately 0.102 at Epoch 30. Data points: (-2, 2.381), (0, 2.381), (10, 0.249), (20, 0.156), (30, 0.102).
* **Teal Line (Ļ<sub>i,j,e</sub>[0.96],MRL):** The line slopes downward, starting at approximately 2.98 at Epoch -2 and ending at approximately 0.082 at Epoch 30. Data points: (-2, 2.98), (0, 2.812), (10, 0.528), (20, 0.099), (30, 0.082).
* **Yellow Line (Ļ<sub>i,j,e</sub>[96,192],MRL):** The line slopes downward, starting at approximately 5.556 at Epoch -2 and ending at approximately 0.257 at Epoch 30. Data points: (-2, 5.556), (0, 2.11e-4), (10, 0.204), (20, 0.311), (30, 0.257).
**Gradient Variance Lines:**
* **Blue Circles (Ļ<sub>k,k,e</sub>[0.192],SMRL):** The line slopes downward, starting at approximately 2.98 at Epoch -2 and ending at approximately 4.99e-7 at Epoch 30. Data points: (-2, 2.98), (0, 7.17e-5), (10, 9.97e-7), (20, 1.64e-7), (30, 4.99e-7).
* **Orange Squares (Ļ<sub>k,k,e</sub>[0.192],MRL):** The line slopes downward, starting at approximately 2.381 at Epoch -2 and ending at approximately 9.17e-8 at Epoch 30. Data points: (-2, 2.381), (0, 2.381), (10, 0.249), (20, 0.156), (30, 0.082).
### Key Observations
* All Average Gradient lines exhibit a decreasing trend as Epochs increase, indicating a reduction in gradient size over time.
* The blue line (Ļ<sub>i,j,e</sub>[0.96],SMRL) shows the most rapid decrease in gradient size.
* The orange line (Ļ<sub>i,j,e</sub>[96,192],SMRL) shows the slowest decrease in gradient size.
* The Gradient Variance lines also generally decrease with increasing Epochs, but the rate of decrease is less pronounced than for the Average Gradient lines.
* The variance values are significantly smaller than the gradient size values, as expected given the different scales.
### Interpretation
The chart demonstrates the behavior of gradient size and variance during the training process (represented by Epochs). The decreasing trend in both gradient size and variance suggests that the model is converging as training progresses. The different lines represent different configurations (SMRL vs. MRL, and different parameter settings within those configurations). The varying rates of decrease indicate that these configurations have different convergence properties. The SMRL configuration with [0.96] parameters appears to converge most rapidly, while the MRL configuration with [96,192] parameters converges the slowest. The difference in the scales of the y-axes highlights that gradient size is typically much larger than gradient variance. The logarithmic scale on the Gradient Size axis emphasizes the initial large values and the subsequent rapid reduction. The chart provides valuable insights into the training dynamics of the model and can be used to optimize the training process by selecting appropriate configurations and monitoring convergence.
</details>
(c) CE Loss
Figure 7: Gradient statistics with Rank, MSE and CE loss (with the vertical axis in a logarithmic scale): average gradient magnitudes of parameters in the ranges $[0,96]$ and $[96,192]$ , as well as the gradient variance over all parameters in the range $[0,192]$ , during training.
### 5.2 Gradient variance of different loss functions
Section 5.1 demonstrates that MRL exhibits higher gradient variance compared to SMRL when rank loss is employed as the loss function, thereby corroborating the findings presented in Section 3.2. To enhance the validation, we conducted additional experiments on the BEIR dataset using rank loss, MSE loss and cross-entropy (CE) loss under identical settings. The results depicted in Figure 7 reveal a consistent pattern across both loss functions, validating the robustness of our conclusions.
### 5.3 Ablation studies
To evaluate the contribution of each component in SMEC to the overall performance, we conduct ablation studies using MRL as the baseline. Different modules are incrementally added on top of MRL, as detailed in table 1. When examined individually, the SMRL strategy achieves the most significant performance gain, suggesting that its reduced gradient variance contributes positively to model performance. In addition, both the ADS module and the S-XBM module also provide notable improvements. The combination of all three components improves the performance of the 128-dimensional embedding by 3.1 points.
| w/ SMRL w/ ADS w/ S-XBM | 0.3808 0.3765 0.3778 | 0.4621 0.4583 0.4583 | 0.4895 0.4863 0.4853 | 0.5283 0.5254 0.5256 |
| --- | --- | --- | --- | --- |
| SMEC (Ours) | 0.4053 | 0.4848 | 0.5002 | 0.5459 |
Table 1: Ablation studies of SMEC on 8 BEIR datasets with MRL as the baseline.
### 5.4 The contribution of ADS in preserving key information
The selection of important parameters in neural networks is a well-established research area, with numerous studies demonstrating that network parameters are often redundant. As a result, Parameter Pruning have been widely adopted for model compression. We consider ADS (or more generally, the MEC family of methods), although it focuses on dimension selection within embeddings, to be fundamentally implemented through network parameter selection. Therefore, ADS can be regarded as a form of pruning method with theoretical feasibility.
To fully demonstrate the effectiveness of ADS, we evaluate both the dimension selection strategies of ADS and MRL using WARE Yu et al. (2018) (Weighted Average Reconstruction Error), a commonly used metric in the pruning area for assessing parameter importance. The WARE is defined as follows:
$$
\text{WARE}=\frac{1}{M}\sum_{m=1}^{M}\frac{|\hat{y}_{m}-y_{m}|}{|y_{m}|} \tag{10}
$$
,where $M$ denotes the number of samples; $\hat{y}_{m}$ and $y_{m}$ represent the modelās score (which can be interpreted as the similarity between embedding pairs) for the $m$ -th sample before and after dimension pruning, respectively. The core idea of WARE is to quantify the change in the modelās output induced by removing a specific dimension; a larger change indicates higher importance of that dimension.
We randomly sampled 10,000 instances from multiple sub-datasets of BEIR. For the LLM2VEC embeddings (3072dim), we computed the WARE for each dimension. Then, we used both ADS and MRL to generate low-dimensional embeddings of 1536, 768, and 256 dimensions, respectively. For each method and compression level, we calculated the achievement rate, which is defined as the proportion of selected dimensions that appear in the top-N most important dimensions according to the WARE-based ranking.
| Dimension 1536 768 | ADS (Dimension Selection) 94.3% 90.1% | MRL (Dimension Truncation) 50.3% 32.8% |
| --- | --- | --- |
| 256 | 83.6% | 17.4% |
Table 2: Achievement Rate of Important Dimension Selection at Different Dimension Levels.
The results in table 2 show that the achievement rate of MRL is roughly linear with the compression ratio, indicating that the importance of dimensions has no strong correlation with their positions. The achievement rate of ADS also decreases as the number of retained dimensions reduces, which is due to the increased difficulty of selecting the top-N most important dimensions under higher compression ratios. However, even when compressed by a factor of 6, ADS still selects over 80 of the most important dimensions. This explains why, as seen in Figure 5, SMEC demonstrates stronger performance at lower dimensions.
### 5.5 Memory size of S-XBM
In this subsection, we explore how the memory size of S-XBM module affects training speed and model performance. Theoretically, as the memory size increases, it is easier for the S-XBM module to mine more hard samples, thereby improving model performance. However, an excessively large memory size may increase the retrieval time for top-k samples, which could negatively affect training efficiency. To prove this observation experimentally, we train the SMEC framework with varying memory sizes (e.g., 1000, 2000, 5000, 10000, and 15000), as illustrated in the table 3. The results demonstrate a clear trade-off between training speed and model performance. We select a memory size of 5000 as our final choice to strike a balance between them.
| Forward Time/s $\downarrow$ NDCG@10 $\uparrow$ | 0.06 0.4631 | 0.08 0.4652 | 0.11 0.4675 | 0.15 0.4682 | 0.21 0.4689 |
| --- | --- | --- | --- | --- | --- |
Table 3: Trade-off analysis of training speed and model performance under different memory size of S-XBM.
## 6 Conclusions
Although high-dimensional embeddings from large language models (LLMs) capture rich semantic features, their practical use is often limited by computational efficiency and storage constraints. To mitigate these limitations, Sequential Matryoshka Embedding Compression (SMEC) framework is proposed in this paper to achieve efficient embedding compression. Our proposed SMEC framework contains Sequential Matryoshka Representation Learning(SMRL) module, adaptive dimension selection (ADS) module and Selectable Cross-batch Memory (S-XBM) module. The SMRL module is designed to mitigate gradient variance during training. The ADS module is utilized to minimize information degradation during feature compression. And the S-XBM is utilized to enhance unsupervised learning between high- and low-dimensional embeddings. Compared to existing approaches, our approaches preserve higher performance at the same compression rate.
## Limitations
The SMEC framework introduces only a small number of additional parameters on top of a pre-trained model and is trained using labeled data from a specific domain, along with mined hard samples, with the aim of reducing the dimensionality of the original embeddings. However, this design and objective limit its generalizability and applicability to broader scenarios. Future work could explore extending the SMEC approach to full-parameter training of representation models, enabling them to directly generate embeddings of multiple dimensions. Additionally, the feasibility of training the model on diverse datasets is also worth investigating.
## References
- (1) Openai text embedding. https://platform.openai.com/docs/guides/embeddings/embeddings.
- Bai et al. (2020) Yalong Bai, Yuxiang Chen, Wei Yu, Linfang Wang, and Wei Zhang. 2020. Products-10k: A large-scale product recognition dataset.
- BehnamGhader et al. (2024) Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961.
- Beyer et al. (1999) K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. 1999. When is nearest neighbor meaningful. Springer.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877ā1901.
- Cai et al. (2024) Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. 2024. Matryoshka multimodal models. In Workshop on Video-Language Models@ NeurIPS 2024.
- Devlin et al. (2019) Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805.
- Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, and 1 others. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Frankle and Carbin (2018) Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
- Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Gu et al. (2023) Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. 2023. Matryoshka diffusion models. In The Twelfth International Conference on Learning Representations.
- Han et al. (2017) Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, and Larry S. Davis. 2017. Automatic spatially-aware fashion concept discovery. IEEE.
- He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729ā9738.
- Hu et al. (2024) Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. 2024. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems, 37:50168ā50188.
- Huang et al. (2024) Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, and Lili Qiu. 2024. Llm2clip: Powerful language model unlock richer visual representation. Preprint, arXiv:2411.04997.
- Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
- Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and HervƩ JƩgou. 2017. Billion-scale similarity search with gpus.
- Jolliffe and Cadima (2016) Ian T. Jolliffe and Jorge Cadima. 2016. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng, 374(2065):20150202.
- Kalervo et al. (2002) JƤrvelin Kalervo, Jaana KekƤlƤinen, and Almahairi. 2002. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422ā446.
- Kingma et al. (2013) Diederik P Kingma, Max Welling, and 1 others. 2013. Auto-encoding variational bayes.
- Kusupati et al. (2022) Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and 1 others. 2022. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233ā30249.
- Lee and Seung (2000) Daniel Lee and H Sebastian Seung. 2000. Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13.
- Li et al. (2023) Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, and Tuo Zhao. 2023. Losparse: Structured compression of large language models based on low-rank and sparse approximation. In International Conference on Machine Learning, pages 20336ā20350. PMLR.
- (24) Geoffrey J Mclachlan. Discriminant analysis and statistical pattern recognition. Wiley-Interscience,.
- O Pinheiro et al. (2020) Pedro O O Pinheiro, Amjad Almahairi, Ryan Benmalek, Florian Golemo, and Aaron C Courville. 2020. Unsupervised learning of dense visual representations. Advances in neural information processing systems, 33:4489ā4500.
- Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- Robertson and Walker (1994) Stephen E. Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. ACM.
- Thakur et al. (2021) Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Åukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
- Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding.
- Wang et al. (2020) Xun Wang, Haozhi Zhang, Weilin Huang, and Matthew R Scott. 2020. Cross-batch memory for embedding learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6388ā6397.
- Yoon et al. (2023) Jinsung Yoon, Sercan O Arik, Yanfei Chen, and Tomas Pfister. 2023. Search-adaptor: Embedding customization for information retrieval. arXiv preprint arXiv:2310.08750.
- Yoon et al. (2024) Jinsung Yoon, Rajarishi Sinha, Sercan O Arik, and Tomas Pfister. 2024. Matryoshka-adaptor: Unsupervised and supervised tuning for smaller embedding dimensions. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10318ā10336, Miami, Florida, USA. Association for Computational Linguistics.
- Yu et al. (2018) Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. 2018. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9194ā9203.
- Yury et al. (2018) Yury, A, Malkov, Dmitry, A, and Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence.
## Appendix A Derivation of the Gradient Fluctuation
To formalize this issue, we analyze the Mean Squared Error (MSE) loss as a representative case. Let $\mathbf{x}_{1}=[x_{1},x_{2},\ldots,x_{n}]^{\top}\in\mathbb{R}^{n}$ and $\mathbf{x}_{2}=[y_{1},y_{2},\ldots,y_{n}]^{\top}\in\mathbb{R}^{n}$ denote two input feature vectors. The final FC layer employs a matrix $\mathbf{W}=[\mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{n}]^{\top}\in\mathbb{R}^{m\times n}$ to generate scalar outputs $\mathbf{y}_{1}=\mathbf{W}\mathbf{x}_{1}\in\mathbb{R}^{m}$ and $\mathbf{y}_{2}=\mathbf{W}\mathbf{x}_{2}\in\mathbb{R}^{m}$ . The MSE loss at dimension $d$ is defined as:
$$
\mathcal{L}^{d}=\left[\mathcal{Y}_{label}-sim(\mathbf{y}^{d}_{1},\mathbf{y}^{d}_{2})\right]^{2}, \tag{11}
$$
where $\mathcal{Y}_{label}$ denotes the binary classification label for pairs (0 or 1), and $sim(\cdot)$ represents the normalized similarity of the learned representations.
According to the chain rule, the partial derivative of $\mathcal{L}^{d}$ with respect to the $i$ -th dimension parameter of the FC layer is derived as:
$$
\frac{\partial\mathcal{L}^{d}}{\partial\mathbf{w}_{i}}=\frac{\partial\mathcal{L}^{d}}{\partial\left[\mathbf{y}^{d}_{1}\right]_{i}}\cdot\frac{\partial\left[\mathbf{y}^{d}_{1}\right]_{i}}{\partial\mathbf{w}_{i}}+\frac{\partial\mathcal{L}^{d}}{\partial\left[\mathbf{y}^{d}_{2}\right]_{i}}\cdot\frac{\partial\left[\mathbf{y}^{d}_{2}\right]_{i}}{\partial\mathbf{w}_{i}}. \tag{12}
$$
Utilizing cosine similarity (clamp to $[0,1]$ ) as the similarity function $sim(\cdot)$ , the equation 11 can be rewritten as:
$$
\mathcal{L}^{d}=\left[\mathcal{Y}_{label}-\frac{{\mathbf{y}^{d}_{1}}^{\top}{\mathbf{y}^{d}_{2}}}{\|\mathbf{y}^{d}_{1}\|\|\mathbf{y}^{d}_{2}\|}\right]^{2}. \tag{13}
$$
Let $\|\mathbf{y}^{d}_{1}\|=A$ , $\|\mathbf{y}^{d}_{2}\|=B$ , ${\mathbf{y}^{d}_{1}}^{\top}{\mathbf{y}^{d}_{2}}=C$ and $s=\frac{C}{AB}$ . The partial derivatives of the $\mathcal{L}^{d}$ with respect to $\left[\mathbf{y}^{d}_{1}\right]_{i}$ and $\left[\mathbf{y}^{d}_{2}\right]_{i}$ are given as follows:
$$
\frac{\partial\mathcal{L}^{d}}{\partial\left[\mathbf{y}^{d}_{1}\right]_{i}}=2\left(s-\mathcal{Y}_{label}\right)\left(\frac{\left[\mathbf{y}^{d}_{2}\right]_{i}}{AB}-\frac{s}{A^{2}}\left[\mathbf{y}^{d}_{1}\right]_{i}\right), \tag{14}
$$
$$
\frac{\partial\mathcal{L}^{d}}{\partial\left[\mathbf{y}^{d}_{2}\right]_{i}}=2\left(s-\mathcal{Y}_{label}\right)\left(\frac{\left[\mathbf{y}^{d}_{1}\right]_{i}}{AB}-\frac{s}{B^{2}}\left[\mathbf{y}^{d}_{2}\right]_{i}\right). \tag{15}
$$
Substituting $\left[\mathbf{y}^{d}_{1}\right]_{i}=\mathbf{w}_{i}\mathbf{x}_{1}$ and $\left[\mathbf{y}^{d}_{2}\right]_{i}=\mathbf{w}_{i}\mathbf{x}_{2}$ , the partial derivatives of the $\left[\mathbf{y}^{d}_{1}\right]_{i}$ and $\left[\mathbf{y}^{d}_{2}\right]_{i}$ with respect to $\mathbf{w}_{i}$ are given as follows:
$$
\frac{\partial\left[\mathbf{y}^{d}_{1}\right]_{i}}{\partial\mathbf{w}_{i}}=\mathbf{x}_{1},\frac{\partial\left[\mathbf{y}^{d}_{2}\right]_{i}}{\partial\mathbf{w}_{i}}=\mathbf{x}_{2}. \tag{16}
$$
Based on the above equations, the partial derivative of $\mathcal{L}^{d}$ with respect to $\mathbf{w}_{i}$ is derived as:
$$
\displaystyle\frac{\partial\mathcal{L}^{d}}{\partial\mathbf{w}_{i}}= \displaystyle\ 2\left(s-\mathcal{Y}_{label}\right)\left[\left(\frac{\left[\mathbf{y}^{d}_{2}\right]_{i}}{AB}-\frac{s}{A^{2}}\left[\mathbf{y}^{d}_{1}\right]_{i}\right)\mathbf{x}_{1}+\left(\frac{\left[\mathbf{y}^{d}_{1}\right]_{i}}{AB}-\frac{s}{B^{2}}\left[\mathbf{y}^{d}_{2}\right]_{i}\right)\mathbf{x}_{2}.\right] \tag{17}
$$
Assume that $A$ and $B$ can be approximated by $\delta(d)\cdot a$ and $\delta(d)\cdot b$ , respectively. Under this approximation, $\delta(d)$ can be used to fit the relationship between the magnitude of vector $\mathbf{x}$ or $\mathbf{y}$ and the variation of $d$ (It is evident that this is a positive correlation). Therefore, Equation 17 can be approximated by the following expression:
$$
\frac{\partial\mathcal{L}^{d}}{\partial\mathbf{w}_{i}}=2\left(s-\mathcal{Y}_{label}\right)\frac{1}{\delta(d)^{2}}\Bigg[\left(\frac{\left[\mathbf{y}^{d}_{2}\right]_{i}}{ab}-\frac{s}{a^{2}}\left[\mathbf{y}^{d}_{1}\right]_{i}\right)\mathbf{x}_{1}+\left(\frac{\left[\mathbf{y}^{d}_{1}\right]_{i}}{ab}-\frac{s}{b^{2}}\left[\mathbf{y}^{d}_{2}\right]_{i}\right)\mathbf{x}_{2}\Bigg]. \tag{18}
$$
In equation 18, $a$ , $b$ , $\left[\mathbf{y}^{d}_{1}\right]_{i}$ , $\left[\mathbf{y}^{d}_{2}\right]_{i}$ are constants, $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ are constant vectors, while $s$ and $\mathcal{Y}_{label}$ are invariant with respect to the index $d$ . Therefore, we can conclude the following:
$$
\frac{\partial\mathcal{L}^{d}}{\partial\mathbf{w}_{i}}\propto\frac{1}{\delta(d)^{2}}. \tag{19}
$$
In theory, this rule can also be extended to other pair-wise similarity-based functions, such as rank loss, which is experimentally verified in Section 5.2.
## Appendix B Results on BEIR Sub-datasets.
We compare the performance of different compression methods on several representative sub-datasets of BEIR, and the results are shown in Table 4.
| Model Sub-datasetāScifact | NDCG@10 128 | 256 | 512 | 768 | 1536 | 3072 |
| --- | --- | --- | --- | --- | --- | --- |
| LLM2Vec | - | - | - | - | - | 0.787 |
| w/ Search-Adaptor | 0.806 | 0.845 | 0.864 | 0.879 | 0.886 | 0.884 |
| w/ MRL-Adaptor | 0.826 | 0.861 | 0.876 | 0.880 | 0.886 | 0.887 |
| w/ SMEC (ours) | 0.841 | 0.874 | 0.879 | 0.882 | 0.885 | 0.886 |
| Sub-datasetāFiQA | | | | | | |
| LLM2Vec | - | - | - | - | - | 0.498 |
| w/ Search-Adaptor | 0.475 | 0.505 | 0.529 | 0.540 | 0.545 | 0.550 |
| w/ MRL-Adaptor | 0.496 | 0.523 | 0.534 | 0.543 | 0.547 | 0.550 |
| w/ SMEC (ours) | 0.521 | 0.533 | 0.540 | 0.546 | 0.549 | 0.551 |
| Sub-datasetāQuora | | | | | | |
| LLM2Vec | - | - | - | - | - | 0.775 |
| w/ Search-Adaptor | 0.771 | 0.805 | 0.830 | 0.845 | 0.861 | 0.864 |
| w/ MRL-Adaptor | 0.784 | 0.812 | 0.834 | 0.847 | 0.862 | 0.863 |
| w/ SMEC (ours) | 0.794 | 0.818 | 0.839 | 0.850 | 0.862 | 0.865 |
| Sub-datasetāNFCorpus | | | | | | |
| LLM2Vec | - | - | - | - | - | 0.389 |
| w/ Search-Adaptor | 0.345 | 0.375 | 0.396 | 0.412 | 0.425 | 0.426 |
| w/ MRL-Adaptor | 0.364 | 0.384 | 0.403 | 0.419 | 0.426 | 0.427 |
| w/ SMEC (ours) | 0.389 | 0.402 | 0.418 | 0.426 | 0.430 | 0.431 |
| Sub-datasetāSciDocs | | | | | | |
| LLM2Vec | - | - | - | - | - | 0.232 |
| w/ Search-Adaptor | 0.204 | 0.225 | 0.245 | 0.250 | 0.258 | 0.263 |
| w/ MRL-Adaptor | 0.220 | 0.240 | 0.250 | 0.255 | 0.262 | 0.265 |
| w/ SMEC (ours) | 0.239 | 0.246 | 0.251 | 0.255 | 0.261 | 0.264 |
Table 4: Comparison of retrieval performance on 5 BEIR sub-datasets.
## Appendix C Experimental results on Products-10K and Fashion-200k.
<details>
<summary>figures/fig_img.png Details</summary>

### Visual Description
## Line Chart: NDCG@10% vs. Dimensions
### Overview
This image presents a line chart comparing the performance of four different methods (PCA, Search-Adaptor, MRL-Adaptor, and SMEC) across varying dimensions. The performance metric is NDCG@10%, plotted against dimensions ranging from 128 to 1024.
### Components/Axes
* **X-axis:** Dimensions, with markers at 128, 256, 512, 768, and 1024.
* **Y-axis:** NDCG@10%, ranging from approximately 40 to 53.
* **Legend:** Located in the top-right corner, identifying the four data series:
* PCA (Blue circle)
* Search-Adaptor (Orange cross)
* MRL-Adaptor (Green triangle)
* SMEC (Yellow square)
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
Let's analyze each line individually:
* **PCA (Blue):** This line starts at approximately 40 at 128 dimensions, increases sharply to around 45 at 256 dimensions, continues to rise more gradually to approximately 47 at 512 dimensions, then plateaus around 48-49 for 768 and 1024 dimensions.
* **Search-Adaptor (Orange):** This line begins at approximately 49 at 128 dimensions, rises to around 50 at 256 dimensions, then increases steadily to approximately 52 at 512 dimensions, and continues to around 53 at 768 and 1024 dimensions.
* **MRL-Adaptor (Green):** This line starts at approximately 49.5 at 128 dimensions, increases to around 51 at 256 dimensions, then rises to approximately 52.5 at 512 dimensions, and plateaus around 53 for 768 and 1024 dimensions.
* **SMEC (Yellow):** This line begins at approximately 51 at 128 dimensions, rises to around 52 at 256 dimensions, then increases to approximately 53 at 512 dimensions, and remains stable at around 53 for 768 and 1024 dimensions.
Here's a table summarizing the approximate data points:
| Dimensions | PCA (NDCG@10%) | Search-Adaptor (NDCG@10%) | MRL-Adaptor (NDCG@10%) | SMEC (NDCG@10%) |
|---|---|---|---|---|
| 128 | 40 | 49 | 49.5 | 51 |
| 256 | 45 | 50 | 51 | 52 |
| 512 | 47 | 52 | 52.5 | 53 |
| 768 | 48 | 53 | 53 | 53 |
| 1024 | 49 | 53 | 53 | 53 |
### Key Observations
* SMEC and MRL-Adaptor consistently outperform PCA across all dimensions.
* Search-Adaptor performs better than PCA, but generally lags behind SMEC and MRL-Adaptor.
* The performance of all methods appears to plateau after 512 dimensions, with diminishing returns from increasing dimensionality further.
* PCA shows the most significant initial improvement with increasing dimensions, but its growth slows down considerably.
### Interpretation
The chart demonstrates the impact of dimensionality on the performance of different methods for a specific task (likely information retrieval or ranking, given the NDCG metric). The NDCG@10% metric indicates the normalized discounted cumulative gain at rank 10, measuring the ranking quality of the top 10 results.
The plateauing effect observed after 512 dimensions suggests that increasing the dimensionality beyond this point does not significantly improve performance, and may even introduce noise or computational overhead. SMEC and MRL-Adaptor's consistently high performance indicates their effectiveness in capturing relevant information even in high-dimensional spaces. PCA, while showing initial gains, appears to be limited in its ability to represent the data effectively at higher dimensions.
The differences in performance between the methods likely stem from their underlying approaches to dimensionality reduction and feature representation. SMEC and MRL-Adaptor may employ more sophisticated techniques that better preserve the relevant information during dimensionality reduction, leading to superior ranking performance. The chart suggests that for this particular task, a dimensionality of around 512 strikes a good balance between performance and computational cost.
</details>
(a) Image retrieval
<details>
<summary>figures/fig_t2i.png Details</summary>

### Visual Description
## Line Chart: NDCG@10% vs. Dimensions
### Overview
This line chart displays the relationship between the number of dimensions and the NDCG@10% metric for four different methods: PCA, Search-Adaptor, MRL-Adaptor, and SMEC. The x-axis represents the dimensions, ranging from 128 to 768, while the y-axis represents the NDCG@10% score, ranging from 14 to 24.
### Components/Axes
* **X-axis Title:** Dimensions
* **Y-axis Title:** NDCG@10%
* **X-axis Markers:** 128, 256, 384, 512, 768
* **Y-axis Markers:** 14, 16, 18, 20, 22, 24
* **Legend:** Located in the bottom-right corner.
* PCA (Blue Circle)
* Search-Adaptor (Orange Line)
* MRL-Adaptor (Green Triangle)
* SMEC (Yellow Square)
### Detailed Analysis
* **PCA (Blue Line):** The line slopes upward, indicating an increase in NDCG@10% as the number of dimensions increases.
* At 128 dimensions: Approximately 14.2
* At 256 dimensions: Approximately 16.2
* At 384 dimensions: Approximately 17.8
* At 512 dimensions: Approximately 18.5
* At 768 dimensions: Approximately 19.0
* **Search-Adaptor (Orange Line):** The line shows an initial steep increase, then plateaus.
* At 128 dimensions: Approximately 18.0
* At 256 dimensions: Approximately 20.2
* At 384 dimensions: Approximately 22.0
* At 512 dimensions: Approximately 22.5
* At 768 dimensions: Approximately 23.0
* **MRL-Adaptor (Green Line):** The line is relatively flat, with a slight upward trend.
* At 128 dimensions: Approximately 22.2
* At 256 dimensions: Approximately 23.0
* At 384 dimensions: Approximately 23.8
* At 512 dimensions: Approximately 24.0
* At 768 dimensions: Approximately 24.0
* **SMEC (Yellow Line):** The line shows a moderate upward trend, similar to Search-Adaptor but starting at a higher value.
* At 128 dimensions: Approximately 22.5
* At 256 dimensions: Approximately 23.2
* At 384 dimensions: Approximately 23.8
* At 512 dimensions: Approximately 24.0
* At 768 dimensions: Approximately 24.0
### Key Observations
* PCA shows the most significant improvement in NDCG@10% with increasing dimensions, but starts from the lowest value.
* MRL-Adaptor and SMEC consistently achieve the highest NDCG@10% scores across all dimensions.
* Search-Adaptor shows a rapid initial improvement, but its performance plateaus at higher dimensions.
* The performance gap between PCA and the other methods widens as the number of dimensions increases.
### Interpretation
The chart demonstrates the impact of dimensionality on the performance of different methods for a retrieval or ranking task, as measured by NDCG@10%. The NDCG@10% metric indicates the ranking quality of the top 10 results.
The fact that PCA's performance improves with dimensionality suggests that it benefits from having more features to represent the data. However, the other methods (Search-Adaptor, MRL-Adaptor, and SMEC) achieve higher scores even at lower dimensions, indicating that they are more efficient at capturing relevant information. The plateauing of Search-Adaptor suggests that beyond a certain point, adding more dimensions does not significantly improve its performance.
The consistently high performance of MRL-Adaptor and SMEC suggests that they are robust methods that are less sensitive to the number of dimensions. This could be due to their ability to effectively learn and adapt to the data, or to their use of more sophisticated techniques for dimensionality reduction or feature selection.
The chart highlights the importance of choosing an appropriate method and dimensionality for a given task. While increasing the number of dimensions can improve performance for some methods, it is not always necessary or beneficial.
</details>
(b) Text-to-Image retrieval
<details>
<summary>figures/fig_i2t.png Details</summary>

### Visual Description
## Line Chart: NDCG@10% vs. Dimensions
### Overview
This line chart displays the relationship between the number of dimensions and the NDCG@10% metric for four different methods: PCA, Search-Adaptor, MRL-Adaptor, and SMEC. The x-axis represents the dimensions, ranging from 128 to 768, while the y-axis represents the NDCG@10% score, ranging from 14 to 24.
### Components/Axes
* **X-axis Title:** Dimensions
* **Y-axis Title:** NDCG@10%
* **X-axis Markers:** 128, 256, 384, 512, 768
* **Y-axis Markers:** 14, 16, 18, 20, 22, 24
* **Legend:** Located in the bottom-right corner.
* PCA (Blue Circle)
* Search-Adaptor (Orange X)
* MRL-Adaptor (Green Triangle)
* SMEC (Yellow Square)
### Detailed Analysis
* **PCA (Blue Line):** The line slopes upward, indicating an increase in NDCG@10% as the number of dimensions increases.
* At 128 dimensions: Approximately 13.2 NDCG@10%
* At 256 dimensions: Approximately 15.1 NDCG@10%
* At 384 dimensions: Approximately 16.3 NDCG@10%
* At 512 dimensions: Approximately 17.5 NDCG@10%
* At 768 dimensions: Approximately 18.5 NDCG@10%
* **Search-Adaptor (Orange Line):** The line also slopes upward, but with a steeper initial increase than PCA.
* At 128 dimensions: Approximately 18.1 NDCG@10%
* At 256 dimensions: Approximately 19.3 NDCG@10%
* At 384 dimensions: Approximately 21.1 NDCG@10%
* At 512 dimensions: Approximately 22.4 NDCG@10%
* At 768 dimensions: Approximately 22.8 NDCG@10%
* **MRL-Adaptor (Green Line):** This line is relatively flat, with a slight upward trend. It starts at a higher NDCG@10% than PCA and Search-Adaptor.
* At 128 dimensions: Approximately 22.2 NDCG@10%
* At 256 dimensions: Approximately 22.3 NDCG@10%
* At 384 dimensions: Approximately 22.8 NDCG@10%
* At 512 dimensions: Approximately 23.2 NDCG@10%
* At 768 dimensions: Approximately 23.4 NDCG@10%
* **SMEC (Yellow Line):** This line shows an upward trend, similar to Search-Adaptor, but starts at a similar level to MRL-Adaptor.
* At 128 dimensions: Approximately 22.4 NDCG@10%
* At 256 dimensions: Approximately 22.6 NDCG@10%
* At 384 dimensions: Approximately 23.2 NDCG@10%
* At 512 dimensions: Approximately 23.8 NDCG@10%
* At 768 dimensions: Approximately 24.0 NDCG@10%
### Key Observations
* MRL-Adaptor and SMEC consistently achieve the highest NDCG@10% scores across all dimensions.
* PCA shows the lowest NDCG@10% scores and the most significant improvement with increasing dimensions.
* Search-Adaptor demonstrates a substantial initial increase in NDCG@10% but plateaus at higher dimensions.
* The performance of all methods appears to converge as the number of dimensions increases, with diminishing returns beyond 512 dimensions.
### Interpretation
The data suggests that increasing the number of dimensions generally improves the NDCG@10% metric for all four methods. However, the rate of improvement varies significantly. MRL-Adaptor and SMEC are the most effective methods, achieving consistently high scores, indicating they are less sensitive to the number of dimensions. PCA, while showing improvement with more dimensions, starts from a lower baseline and may require a substantial number of dimensions to reach comparable performance. The plateauing of Search-Adaptor suggests that its performance is limited by other factors beyond dimensionality.
The convergence of the lines at higher dimensions indicates a point of diminishing returns. Adding more dimensions beyond a certain point (around 512-768 in this case) does not yield significant improvements in NDCG@10%. This could be due to factors such as noise in the data or the limitations of the models themselves. The choice of method and dimensionality should be based on a trade-off between performance and computational cost.
</details>
(c) Image-to-Text retrieval
Figure 8: Experimental results on image and multimodal datasets. (a) presents the results on the Products-10K dataset using an image representation model based on ViT-H (with 1024 dimensions). (b) and (c) show the results on the Fashion-200K dataset for text-to-image and image-to-text retrieval tasks, respectively, using the LLM2CLIP model (with 768 dimensions, base on ViT-L/14 and Llama-3.2-1B).