# Federated Model Heterogeneous Matryoshka Representation Learning
Abstract
Model heterogeneous federated learning (MHeteroFL) enables FL clients to collaboratively train models with heterogeneous structures in a distributed fashion. However, existing MHeteroFL methods rely on training loss to transfer knowledge between the client model and the server model, resulting in limited knowledge exchange. To address this limitation, we propose the Fed erated model heterogeneous M atryoshka R epresentation L earning (FedMRL) approach for supervised learning tasks. It adds an auxiliary small homogeneous model shared by clients with heterogeneous local models. (1) The generalized and personalized representations extracted by the two models’ feature extractors are fused by a personalized lightweight representation projector. This step enables representation fusion to adapt to local data distribution. (2) The fused representation is then used to construct Matryoshka representations with multi-dimensional and multi-granular embedded representations learned by the global homogeneous model header and the local heterogeneous model header. This step facilitates multi-perspective representation learning and improves model learning capability. Theoretical analysis shows that FedMRL achieves a $\mathcal{O}(1/T)$ non-convex convergence rate. Extensive experiments on benchmark datasets demonstrate its superior model accuracy with low communication and computational costs compared to seven state-of-the-art baselines. It achieves up to $8.48\%$ and $24.94\%$ accuracy improvement compared with the state-of-the-art and the best same-category baseline, respectively.
1 Introduction
Traditional federated learning (FL) [29] often relies on a central FL server to coordinate multiple data owners (a.k.a., FL clients) to train a global shared model without exposing local data. In each communication round, the server broadcasts the global model to the clients. A client trains it on its local data and sends the updated local model to the FL server. The server aggregates local models to produce a new global model. These steps are repeated until the global model converges.
However, the above design cannot handle the following heterogeneity challenges [49] commonly found in practical FL applications: (1) Data heterogeneity [40, 45, 44, 47, 39, 55]: FL clients’ local data often follow non-independent and identically distributions (non-IID). A single global model produced by aggregating local models trained on non-IID data might not perform well on all clients. (2) System heterogeneity [11, 46, 48]: FL clients can have diverse system configurations in terms of computing power and network bandwidth. Training the same model structure among such clients means that the global model size must accommodate the weakest device, leading to sub-optimal performance on other more powerful clients. (3) Model heterogeneity [41]: When FL clients are enterprises, they might have heterogeneous proprietary models which cannot be directly shared with others during FL training due to intellectual property (IP) protection concerns.
To address these challenges, the field of model heterogeneous federated learning (MHeteroFL) [52, 49, 53, 54, 51, 50] has emerged. It enables FL clients to train local models with tailored structures suitable for local system resources and local data distributions. Existing MHeteroFL methods [38, 43] are limited in terms of knowledge transfer capabilities as they commonly leverage the training loss between server and client models for this purpose. This design leads to model performance bottlenecks, incurs high communication and computation costs, and risks exposing private local model structures and data.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Matryoshka Representations Architecture
### Overview
This diagram illustrates the architecture of a "Matryoshka Reps" system, likely a machine learning model. It depicts a feature extraction process followed by a series of representations (ŷ1, ŷ2, ŷ3) and loss calculations (ℓ1, ℓ2, ℓ3) that converge into a final loss function (ℓ). The diagram uses a visual metaphor of Russian nesting dolls (Matryoshka dolls) to represent the hierarchical nature of the representations.
### Components/Axes
The diagram consists of the following components:
* **Input (x):** Represented by a light blue trapezoid.
* **Feature Extractor:** A dark blue rectangle labeled "Feature Extractor".
* **Matryoshka Reps:** A title indicating the overall system.
* **Representations (ŷ1, ŷ2, ŷ3):** Three colored rectangles (pink, yellow, light green) representing different levels of representation.
* **Losses (ℓ1, ℓ2, ℓ3):** Three rectangles (pink, yellow, light green) representing the loss associated with each representation.
* **Final Loss (ℓ):** A white circle representing the combined loss function.
* **Arrows:** Indicate the flow of data and calculations.
* **Nested Doll Image:** A small image of a Matryoshka doll within a dashed green box, visually linking to the "Matryoshka Reps" concept.
* **Headers:** Label at the bottom of the diagram.
### Detailed Analysis / Content Details
The diagram shows a sequential process:
1. **Input (x)** is fed into the **Feature Extractor**.
2. The **Feature Extractor** outputs a set of features, which are then used to generate three representations:
* **ŷ1:** Output from the pink rectangle.
* **ŷ2:** Output from the yellow rectangle.
* **ŷ3:** Output from the light green rectangle.
3. Each representation (ŷi) is associated with a corresponding loss function (ℓi):
* **ℓ1:** Loss associated with ŷ1.
* **ℓ2:** Loss associated with ŷ2.
* **ℓ3:** Loss associated with ŷ3.
4. The individual losses (ℓ1, ℓ2, ℓ3) are combined to produce a final loss function (ℓ), represented by the white circle. The arrows converging into the white circle indicate this combination.
The diagram does not provide any numerical values or specific details about the feature extraction process or the loss functions. It is a high-level architectural overview.
### Key Observations
* The use of nested rectangles and the Matryoshka doll image suggests a hierarchical structure where each representation builds upon the previous one.
* The diagram emphasizes the importance of loss functions in evaluating and refining the representations.
* The flow of information is unidirectional, from input to final loss.
### Interpretation
The diagram likely represents a multi-level representation learning system. The "Matryoshka Reps" name suggests that the representations are nested, with each level capturing increasingly abstract or refined features of the input data. The loss functions (ℓ1, ℓ2, ℓ3) likely serve to optimize each level of representation, while the final loss (ℓ) provides an overall measure of the system's performance. This architecture could be used in various machine learning tasks, such as image recognition, natural language processing, or reinforcement learning, where hierarchical representations are beneficial. The diagram is conceptual and does not provide details about the specific algorithms or parameters used in the system. It is a visual explanation of the overall structure and flow of information.
</details>
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: Neural Network Architecture
### Overview
The image depicts a simplified neural network architecture, specifically a feature extractor followed by a header (classification) component. The diagram illustrates the flow of data from input 'x' to output 'ŷ' through a series of convolutional and fully connected layers.
### Components/Axes
The diagram consists of two main blocks: "Feature Extractor" and "Header".
- **Feature Extractor:** Contains four sequential layers: Conv1, Conv2, FC1, and FC2.
- **Header:** Contains a single fully connected layer: FC3.
- **Input:** 'x'
- **Output:** 'ŷ'
- **Intermediate Representation:** "Rep"
- The diagram uses arrows to indicate the direction of data flow.
- Dashed boxes enclose the Feature Extractor and Header blocks.
### Detailed Analysis or Content Details
The diagram shows a sequential data flow:
1. Input 'x' enters the "Feature Extractor".
2. 'x' is processed by Conv1, then Conv2.
3. The output of Conv2 is fed into FC1, followed by FC2.
4. The output of FC2 is labeled as "Rep" (Representation).
5. "Rep" is then passed to the "Header".
6. The "Header" consists of a single layer, FC3, which produces the output 'ŷ'.
The layers are represented as rectangular blocks. The connections between layers are indicated by arrows. The "Feature Extractor" block is enclosed in a larger dashed box, and the "Header" block is enclosed in a smaller dashed box. The "Rep" label is positioned between the "Feature Extractor" and "Header" blocks, indicating an intermediate representation of the input data.
### Key Observations
The diagram illustrates a common pattern in neural network design: a feature extraction stage followed by a classification or regression stage. The use of convolutional layers (Conv1, Conv2) suggests that the network is designed to process image or other grid-like data. The fully connected layers (FC1, FC2, FC3) are used to map the extracted features to the final output.
### Interpretation
This diagram represents a basic convolutional neural network (CNN) architecture. The "Feature Extractor" learns hierarchical representations of the input data through convolutional and fully connected layers. The "Header" then uses these learned features to make a prediction (ŷ). The intermediate representation "Rep" captures the essential features extracted from the input. This architecture is commonly used for image classification, object detection, and other computer vision tasks. The diagram is a high-level overview and does not specify details such as the number of filters in the convolutional layers, the number of neurons in the fully connected layers, or the activation functions used. It is a conceptual illustration of the network's structure and data flow.
</details>
Figure 1: Left: Matryoshka Representation Learning. Right: Feature extractor and prediction header.
Recently, Matryoshka Representation Learning (MRL) [21] has emerged to tailor representation dimensions based on the computational and storage costs required by downstream tasks to achieve a near-optimal trade-off between model performance and inference costs. As shown in Figure 1 (left), the representation extracted by the feature extractor is constructed to form Matryoshka Representations involving a series of embedded representations ranging from low-to-high dimensions and coarse-to-fine granularities. Each of them is processed by a single output layer for calculating loss, and the sum of losses from all branches is used to update model parameters. This design is inspired by the insight that people often first perceive the coarse aspect of a target before observing the details, with multi-perspective observations enhancing understanding.
Inspired by MRL, we address the aforementioned limitations of MHeteroFL by proposing the Fed erated model heterogeneous M atryoshka R epresentation L earning (FedMRL) approach for supervised learning tasks. For each client, a shared global auxiliary homogeneous small model is added to interact with its heterogeneous local model. Both two models consist of a feature extractor and a prediction header, as depicted in Figure 1 (right). FedMRL has two key design innovations. (1) Adaptive Representation Fusion: for each local data sample, the feature extractors of the two local models extract generalized and personalized representations, respectively. The two representations are spliced and then mapped to a fused representation by a lightweight personalized representation projector adapting to local non-IID data. (2) Multi-Granularity Representation Learning: the fused representation is used to construct Matryoshka Representations involving multi-dimension and multi-granularity embedded representations, which are processed by the prediction headers of the two models, respectively. The sum of their losses is used to update all models, which enhances the model learning capability owing to multi-perspective representation learning.
The personalized multi-granularity MRL enhances representation knowledge interaction between the homogeneous global model and the heterogeneous client local model. Each client’s local model and data are not exposed during training for privacy-preservation. The server and clients only transmit the small homogeneous models, thereby incurring low communication costs. Each client only trains a small homogeneous model and a lightweight representation projector in addition, incurring low extra computational costs. We theoretically derive the $\mathcal{O}(1/T)$ non-convex convergence rate of FedMRL and verify that it can converge over time. Experiments on benchmark datasets comparing FedMRL against seven state-of-the-art baselines demonstrate its superiority. It improves model accuracy by up to $8.48\%$ and $24.94\%$ over the best baseline and the best same-category baseline, while incurring lower communication and computation costs.
2 Related Work
Existing MHeteroFL works can be divided into the following four categories.
MHeteroFL with Adaptive Subnets. These methods [3, 4, 5, 11, 14, 56, 64] construct heterogeneous local subnets of the global model by parameter pruning or special designs to match with each client’s local system resources. The server aggregates heterogeneous local subnets wise parameters to generate a new global model. In cases where clients hold black-box local models with heterogeneous structures not derived from a common global model, the server is unable to aggregate them.
MHeteroFL with Knowledge Distillation. These methods [6, 8, 9, 15, 16, 17, 22, 23, 25, 27, 30, 32, 35, 36, 42, 57, 59] often perform knowledge distillation on heterogeneous client models by leveraging a public dataset with the same data distribution as the learning task. In practice, such a suitable public dataset can be hard to find. Others [12, 60, 61, 63] train a generator to synthesize a shared dataset to deal with this issue. However, this incurs high training costs. The rest (FD [19], FedProto [41] and others [1, 2, 13, 49, 58]) share the intermediate information of client local data for knowledge fusion.
MHeteroFL with Model Split. These methods split models into feature extractors and predictors. Some [7, 10, 31, 33] share homogeneous feature extractors across clients and personalize predictors, while others (LG-FedAvg [24] and [18, 26]) do the opposite. Such methods expose part of the local model structures, which might not be acceptable if the models are proprietary IPs of the clients.
MHeteroFL with Mutual Learning. These methods (FedAPEN [34], FML [38], FedKD [43] and others [28]) add a shared global homogeneous small model on top of each client’s heterogeneous local model. For each local data sample, the distance of the outputs from these two models is used as the mutual loss to update model parameters. Nevertheless, the mutual loss only transfers limited knowledge between the two models, resulting in model performance bottlenecks.
The proposed FedMRL approach further optimizes mutual learning-based MHeteroFL by enhancing the knowledge transfer between the server and client models. It achieves personalized adaptive representation fusion and multi-perspective representation learning, thereby facilitating more knowledge interaction across the two models and improving model performance.
3 The Proposed FedMRL Approach
FedMRL aims to tackle data, system, and model heterogeneity in supervised learning tasks, where a central FL server coordinates $N$ FL clients to train heterogeneous local models. The server maintains a global homogeneous small model $\mathcal{G}(\theta)$ shared by all clients. Figure 2 depicts its workflow Algorithm 1 in Appendix A describes the FedMRL algorithm.:
1. In each communication round, $K$ clients participate in FL (i.e., the client participant rate $C=K/N$ ). The global homogeneous small model $\mathcal{G}(\theta)$ is broadcast to them.
1. Each client $k$ holds a heterogeneous local model $\mathcal{F}_{k}(\omega_{k})$ ( $\mathcal{F}_{k}(·)$ is the heterogeneous model structure, and $\omega_{k}$ are personalized model parameters). Client $k$ simultaneously trains the heterogeneous local model and the global homogeneous small model on local non-IID data $D_{k}$ ( $D_{k}$ follows the non-IID distribution $P_{k}$ ) via personalized Matryoshka Representations Learning with a personalized representation projector $\mathcal{P}_{k}(\varphi_{k})$ .
1. The updated homogeneous small models are uploaded to the server for aggregation to produce a new global model for knowledge fusion across heterogeneous clients.
The objective of FedMRL is to minimize the sum of the loss from the combined models ( $\mathcal{W}_{k}(w_{k})=(\mathcal{G}(\theta)\circ\mathcal{F}_{k}(\omega_{k})|%
\mathcal{P}_{k}(\varphi_{k}))$ ) on all clients, i.e.,
$$
\min_{\theta,\omega_{0,\ldots,N-1}}\sum_{k=0}^{N-1}\ell\left(\mathcal{W}_{k}%
\left(D_{k};\left(\theta\circ\omega_{k}\mid\varphi_{k}\right)\right)\right). \tag{1}
$$
These steps repeat until each client’s model converges. After FL training, a client uses its local combined model without the global header for inference. Appendix C.3 provides experimental evidence for inference model selection.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Diagram: Federated Learning with Matryoshka Representation
### Overview
This diagram illustrates a federated learning system employing a Matryoshka representation for privacy-preserving model training. The system involves a server and multiple clients (Client 1 is shown). Clients train local models on their private data and send model updates to the server, which aggregates them to create a global model. The Matryoshka representation adds an additional layer of privacy by encoding client data into a series of nested representations.
### Components/Axes
The diagram is segmented into two main sections: "Server" (top, light blue) and "Client 1" (bottom, yellow). Key components include:
* **Input (xᵢ):** Image of two pandas.
* **Homo. Extractor (Gex(θex)):** Homogeneous Extractor.
* **Hetero. Extractor (Tex(ωex)):** Heterogeneous Extractor.
* **Rep1 & Rep2:** Representations from the Homo. and Hetero. Extractors respectively.
* **Splice:** Operation combining Rep1 and Rep2.
* **Proj:** Projection operation.
* **Matryoshka Reps:** Nested representations (orange, apple, banana).
* **Header1 & Header2:** Classification headers.
* **Output 1 (ŷᵢ) & Output 2 (ŷᵢ¹):** Model outputs.
* **Loss 1 & Loss 2:** Loss functions.
* **Label (yᵢ):** Ground truth label.
* **Local Homo. Model 1, 2, 3:** Local homogeneous models on the server.
* **Global Homo. Model:** Global homogeneous model on the server.
* **G(θ), G(θ₁), G(θ₂), G(θ₃):** Model functions with parameters.
* **Rᵢ, Rᵢ¹, Rᵢᶜ, Rᵢᵈ¹, Rᵢᵈ²:** Intermediate representations.
* **Ghd(ghd), Fhd(whd):** Header functions.
* **Arrows with numbers (1, 2, 3):** Indicate the flow of information and aggregation steps.
### Detailed Analysis or Content Details
The diagram depicts the following flow:
1. **Client-Side Processing:**
* An input image (xᵢ) of two pandas is fed into both a Homogeneous Extractor (Gex(θex)) and a Heterogeneous Extractor (Tex(ωex)).
* The Homo. Extractor produces representation Rep1 (Rᵢᶜ).
* The Hetero. Extractor produces representation Rep2 (Rᵢᵈ²).
* Rep1 and Rep2 are spliced together to create Rᵢ.
* Rᵢ is then projected (Proj) to create a Matryoshka representation (tilde Rᵢ) containing nested representations (orange, apple, banana).
* The Matryoshka representation is fed into Header1 and Header2.
* Header1 produces Output 1 (ŷᵢ) and calculates Loss 1 based on the Label (yᵢ).
* Header2 produces Output 2 (ŷᵢ¹) and calculates Loss 2.
* Model Inference is performed.
2. **Server-Side Aggregation:**
* The server hosts three Local Homo. Models (Model 1, Model 2, Model 3) with parameters θ₁, θ₂, and θ₃ respectively.
* Each local model receives updates from clients (indicated by arrow 3).
* The server aggregates the updates to create a Global Homo. Model with parameters θ.
* The Global Homo. Model is then used to generate predictions.
3. **Information Flow:**
* Arrow 1 indicates the flow of the Global Homo. Model to the clients.
* Arrow 2 indicates the flow of the Matryoshka representation from the client to the server.
* Arrow 3 indicates the flow of updates from the client's Homo. Extractor to the server's Local Homo. Models.
### Key Observations
* The system utilizes both homogeneous and heterogeneous extractors to process client data.
* The Matryoshka representation appears to be a key component for privacy preservation, encoding data into nested representations.
* The server aggregates updates from multiple clients to create a global model.
* The diagram highlights the separation between client-side processing and server-side aggregation.
* The use of Loss 1 and Loss 2 suggests a multi-task learning or auxiliary loss setup.
### Interpretation
This diagram demonstrates a federated learning approach designed to protect client privacy. The use of both homogeneous and heterogeneous extractors suggests the system can handle diverse data types or feature spaces. The Matryoshka representation likely adds an additional layer of privacy by obscuring the original data while still allowing the model to learn useful features. The server's role is to aggregate model updates without directly accessing the raw client data. The two headers and associated losses suggest a potential for learning multiple representations or tasks simultaneously. The overall architecture aims to balance model accuracy with data privacy, a crucial consideration in many real-world applications. The nested representations (Matryoshka) are a clever way to encode information in a hierarchical manner, potentially making it more difficult to reconstruct the original data. The diagram suggests a complex system with multiple stages of processing and aggregation, designed to achieve a high level of privacy and accuracy.
</details>
Figure 2: The workflow of FedMRL.
3.1 Adaptive Representation Fusion
We denote client $k$ ’s heterogeneous local model feature extractor as $\mathcal{F}_{k}^{ex}(\omega_{k}^{ex})$ , and prediction header as $\mathcal{F}_{k}^{hd}(\omega_{k}^{hd})$ . We denote the homogeneous global model feature extractor as $\mathcal{G}^{ex}(\theta^{ex})$ and prediction header as $\mathcal{G}^{hd}(\theta^{hd})$ . Client $k$ ’s local personalized representation projector is denoted as $\mathcal{P}_{k}(\varphi_{k})$ . In the $t$ -th communication round, client $k$ inputs its local data sample $(\boldsymbol{x}_{i},y_{i})∈ D_{k}$ into the two feature extractors to extract generalized and personalized representations as:
$$
\boldsymbol{\mathcal{R}}_{i}^{\mathcal{G}}=\ \mathcal{G}^{ex}({\boldsymbol{x}_%
{i};\theta}^{ex,t-1}),\boldsymbol{\mathcal{R}}_{i}^{\mathcal{F}_{k}}=\ %
\mathcal{F}_{k}^{ex}(\boldsymbol{x}_{i};\omega_{k}^{ex,t-1}). \tag{2}
$$
The two extracted representations $\boldsymbol{\mathcal{R}}_{i}^{\mathcal{G}}∈\mathbb{R}^{d_{1}}$ and $\boldsymbol{\mathcal{R}}_{i}^{\mathcal{F}_{k}}∈\mathbb{R}^{d_{2}}$ are spliced as:
$$
\boldsymbol{\mathcal{R}}_{i}=\boldsymbol{\mathcal{R}}_{i}^{\mathcal{G}}\circ%
\boldsymbol{\mathcal{R}}_{i}^{\mathcal{F}_{k}}. \tag{3}
$$
Then, the spliced representation is mapped into a fused representation by the lightweight representation projector $\mathcal{P}_{k}(\varphi_{k}^{t-1})$ as:
$$
{\widetilde{\boldsymbol{\mathcal{R}}}}_{i}=\mathcal{P}_{k}(\boldsymbol{%
\mathcal{R}}_{i}{;\varphi}_{k}^{t-1}), \tag{4}
$$
where the projector can be a one-layer linear model or multi-layer perceptron. The fused representation ${\widetilde{\boldsymbol{\mathcal{R}}}}_{i}$ contains both generalized and personalized feature information. It has the same dimension as the client’s local heterogeneous model representation $\mathbb{R}^{d_{2}}$ , which ensures the representation dimension $\mathbb{R}^{d_{2}}$ and the client local heterogeneous model header parameter dimension $\mathbb{R}^{d_{2}× L}$ ( $L$ is the label dimension) match.
The representation projector can be updated as the two models are being trained on local non-IID data. Hence, it achieves personalized representation fusion adaptive to local data distributions. Splicing the representations extracted by two feature extractors can keep the relative semantic space positions of the generalized and personalized representations, benefiting the construction of multi-granularity Matryoshka Representations. Owing to representation splicing, the representation dimensions of the two feature extractors can be different (i.e., $d_{1}≤ d_{2}$ ). Therefore, we can vary the representation dimension of the small homogeneous global model to improve the trade-off among model performance, storage requirement and communication costs.
In addition, each client’s local model is treated as a black box by the FL server. When the server broadcasts the global homogeneous small model to the clients, each client can adjust the linear layer dimension of the representation projector to align it with the dimension of the spliced representation. In this way, different clients may hold different representation projectors. When a new model-agnostic client joins in FedMRL, it can adjust its representation projector structure for local model training. Therefore, FedMRL can accommodate FL clients owning local models with diverse structures.
3.2 Multi-Granular Representation Learning
To construct multi-dimensional and multi-granular Matryoshka Representations, we further extract a low-dimension coarse-granularity representation ${\widetilde{\boldsymbol{\mathcal{R}}}}_{i}^{lc}$ and a high-dimension fine-granularity representation ${\widetilde{\boldsymbol{\mathcal{R}}}}_{i}^{hf}$ from the fused representation ${\widetilde{\boldsymbol{\mathcal{R}}}}_{i}$ . They align with the representation dimensions $\{\mathbb{R}^{d_{1}},\mathbb{R}^{d_{2}}\}$ of two feature extractors for matching the parameter dimensions $\{\mathbb{R}^{d_{1}× L},\mathbb{R}^{d_{2}× L}\}$ of the two prediction headers,
$$
{\widetilde{\boldsymbol{\mathcal{R}}}}_{i}^{lc}={{\widetilde{\boldsymbol{%
\mathcal{R}}}}_{i}}^{1:d_{1}},{\widetilde{\boldsymbol{\mathcal{R}}}}_{i}^{hf}=%
{{\widetilde{\boldsymbol{\mathcal{R}}}}_{i}}^{1:d_{2}}. \tag{5}
$$
The embedded low-dimension coarse-granularity representation ${\widetilde{\boldsymbol{\mathcal{R}}}}_{i}^{lc}∈\mathbb{R}^{d_{1}}$ incorporates coarse generalized and personalized feature information. It is learned by the global homogeneous model header $\mathcal{G}^{hd}(\theta^{hd,t-1})$ (parameter space: $\mathbb{R}^{d_{1}× L}$ ) with generalized prediction information to produce:
$$
{\hat{{y}}}_{i}^{\mathcal{G}}=\mathcal{G}^{hd}({\widetilde{\boldsymbol{%
\mathcal{R}}}}_{i}^{lc};\theta^{hd,t-1}). \tag{6}
$$
The embedded high-dimension fine-granularity representation ${\widetilde{\boldsymbol{\mathcal{R}}}}_{i}^{hf}∈\mathbb{R}^{d_{2}}$ carries finer generalized and personalized feature information, which is further processed by the heterogeneous local model header $\mathcal{F}_{k}^{hd}(\omega_{k}^{hd,t-1})$ (parameter space: $\mathbb{R}^{d_{2}× L}$ ) with personalized prediction information to generate:
$$
{\hat{{y}}}_{i}^{\mathcal{F}_{k}}=\mathcal{F}_{k}^{hd}({\widetilde{\boldsymbol%
{\mathcal{R}}}}_{i}^{hf};\omega_{k}^{hd,t-1}). \tag{7}
$$
We compute the losses $\ell$ (e.g., cross-entropy loss [62]) between the two outputs and the label $y_{i}$ as:
$$
\ell_{i}^{\mathcal{G}}=\ell({\hat{{y}}}_{i}^{\mathcal{G}},y_{i}),\ \ell_{i}^{%
\mathcal{F}_{k}}=\ell({\hat{{y}}}_{i}^{\mathcal{F}_{k}},y_{i}). \tag{8}
$$
Then, the losses of the two branches are weighted by their importance $m_{i}^{\mathcal{G}}$ and $m_{i}^{\mathcal{F}_{k}}$ and summed as:
$$
\ell_{i}=m_{i}^{\mathcal{G}}\cdot\ell_{i}^{\mathcal{G}}+m_{i}^{\mathcal{F}_{k}%
}\cdot\ell_{i}^{\mathcal{F}_{k}}. \tag{9}
$$
We set $m_{i}^{\mathcal{G}}=m_{i}^{\mathcal{F}_{k}}=1$ by default to make the two models contribute equally to model performance. The complete loss $\ell_{i}$ is used to simultaneously update the homogeneous global small model, the heterogeneous client local model, and the representation projector via gradient descent:
$$
\displaystyle\theta_{k}^{t} \displaystyle\leftarrow\theta^{t-1}-\eta_{\theta}\nabla\ell_{i}, \displaystyle\omega_{k}^{t} \displaystyle\leftarrow\omega_{k}^{t-1}-\eta_{\omega}\nabla\ell_{i}, \displaystyle\varphi_{k}^{t} \displaystyle\leftarrow\varphi_{k}^{t-1}-\eta_{\varphi}\nabla\ell_{i}, \tag{10}
$$
where $\eta_{\theta},\eta_{\omega},\ \eta_{\varphi}$ are the learning rates of the homogeneous global small model, the heterogeneous local model and the representation projector. We set $\eta_{\theta}=\eta_{\omega}=\ \eta_{\varphi}$ by default to ensure stable model convergence. In this way, the generalized and personalized fused representation is learned from multiple perspectives, thereby improving model learning capability.
4 Convergence Analysis
Based on notations, assumptions and proofs in Appendix B, we analyse the convergence of FedMRL.
**Lemma 1**
*Local Training. Given Assumptions 1 and 2, the loss of an arbitrary client’s local model $w$ in local training round $(t+1)$ is bounded by:
$$
\mathbb{E}[\mathcal{L}_{(t+1)E}]\leq\mathcal{L}_{tE+0}+(\frac{L_{1}\eta^{2}}{2%
}-\eta)\sum_{e=0}^{E}\|\nabla\mathcal{L}_{tE+e}\|_{2}^{2}+\frac{L_{1}E\eta^{2}%
\sigma^{2}}{2}. \tag{11}
$$*
**Lemma 2**
*Model Aggregation. Given Assumptions 2 and 3, after local training round $(t+1)$ , a client’s loss before and after receiving the updated global homogeneous small models is bounded by:
$$
\mathbb{E}[\mathcal{L}_{(t+1)E+0}]\leq\mathbb{E}[\mathcal{L}_{tE+1}]+{\eta%
\delta}^{2}. \tag{12}
$$*
**Theorem 1**
*One Complete Round of FL. Given the above lemmas, for any client, after receiving the updated global homogeneous small model, we have:
$$
\mathbb{E}[\mathcal{L}_{(t+1)E+0}]\leq\mathcal{L}_{tE+0}+(\frac{L_{1}\eta^{2}}%
{2}-\eta)\sum_{e=0}^{E}\|\nabla\mathcal{L}_{tE+e}\|_{2}^{2}+\frac{L_{1}E\eta^{%
2}\sigma^{2}}{2}+\eta\delta^{2}. \tag{13}
$$*
**Theorem 2**
*Non-convex Convergence Rate of FedMRL. Given Theorem 1, for any client and an arbitrary constant $\epsilon>0$ , the following holds:
$$
\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{e=0}^{E-1}\|\nabla\mathcal{L}_{%
tE+e}\|_{2}^{2} \displaystyle\leq\frac{\frac{1}{T}\sum_{t=0}^{T-1}[\mathcal{L}_{tE+0}-\mathbb{%
E}[\mathcal{L}_{(t+1)E+0}]]+\frac{L_{1}E\eta^{2}\sigma^{2}}{2}+\eta\delta^{2}}%
{\eta-\frac{L_{1}\eta^{2}}{2}}<\epsilon, \displaystyle s.t. \displaystyle\eta<\frac{2(\epsilon-\delta^{2})}{L_{1}(\epsilon+E\sigma^{2})}. \tag{14}
$$*
Therefore, we conclude that any client’s local model can converge at a non-convex rate of $\epsilon\sim\mathcal{O}(1/T)$ in FedMRL if the learning rates of the homogeneous small model, the client local heterogeneous model and the personalized representation projector satisfy the above conditions.
5 Experimental Evaluation
We implement FedMRL on Pytorch, and compare it with seven state-of-the-art MHeteroFL methods. The experiments are carried out over two benchmark supervised image classification datasets on $4$ NVIDIA GeForce 3090 GPUs (24GB Memory). Codes are available in supplemental materials.
5.1 Experiment Setup
Datasets. The benchmark datasets adopted are CIFAR-10 and CIFAR-100 https://www.cs.toronto.edu/%7Ekriz/cifar.html [20], which are commonly used in FL image classification tasks for the evaluating existing MHeteroFL algorithms. CIFAR-10 has $60,000$ $32× 32$ colour images across $10$ classes, with $50,000$ for training and $10,000$ for testing. CIFAR-100 has $60,000$ $32× 32$ colour images across $100$ classes, with $50,000$ for training and $10,000$ for testing. We follow [37] and [34] to construct two types of non-IID datasets. Each client’s non-IID data are further divided into a training set and a testing set with a ratio of $8:2$ .
- Non-IID (Class): For CIFAR-10 with $10$ classes, we randomly assign $2$ classes to each FL client. For CIFAR-100 with $100$ classes, we randomly assign $10$ classes to each FL client. The fewer classes each client possesses, the higher the non-IIDness.
- Non-IID (Dirichlet): To produce more sophisticated non-IID data settings, for each class of CIFAR-10/CIFAR-100, we use a Dirichlet( $\alpha$ ) function to adjust the ratio between the number of FL clients and the assigned data. A smaller $\alpha$ indicates more pronounced non-IIDness.
Models. We evaluate MHeteroFL algorithms under model-homogeneous and heterogeneous FL scenarios. FedMRL ’s representation projector is a one-layer linear model (parameter space: $\mathbb{R}^{d2×(d_{1}+d_{2})}$ ).
- Model-Homogeneous FL: All clients train CNN-1 in Table 2 (Appendix C.1). The homogeneous global small models in FML and FedKD are also CNN-1. The extra homogeneous global small model in FedMRL is CNN-1 with a smaller representation dimension $d_{1}$ (i.e., the penultimate linear layer dimension) than the CNN-1 model’s representation dimension $d_{2}$ , $d_{1}≤ d_{2}$ .
- Model-Heterogeneous FL: The $5$ heterogeneous models {CNN-1, $...$ , CNN-5} in Table 2 (Appendix C.1) are evenly distributed among FL clients. The homogeneous global small models in FML and FedKD are the smallest CNN-5 models. The homogeneous global small model in FedMRL is the smallest CNN-5 with a reduced representation dimension $d_{1}$ compared with the CNN-5 model representation dimension $d_{2}$ , i.e., $d_{1}≤ d_{2}$ .
Comparison Baselines. We compare FedMRL with state-of-the-art algorithms belonging to the following three categories of MHeteroFL methods:
- Standalone. Each client trains its heterogeneous local model only with its local data.
- Knowledge Distillation Without Public Data: FD [19] and FedProto [41].
- Model Split: LG-FedAvg [24].
- Mutual Learning: FML [38], FedKD [43] and FedAPEN [34].
Evaluation Metrics. We evaluate MHeteroFL algorithms from the following three aspects:
- Model Accuracy. We record the test accuracy of each client’s model in each round, and compute the average test accuracy.
- Communication Cost. We compute the number of parameters sent between the server and one client in one communication round, and record the required rounds for reaching the target average accuracy. The overall communication cost of one client for target average accuracy is the product between the cost per round and the number of rounds.
- Computation Overhead. We compute the computation FLOPs of one client in one communication round, and record the required communication rounds for reaching the target average accuracy. The overall computation overall for one client achieving the target average accuracy is the product between the FLOPs per round and the number of rounds.
Training Strategy. We search optimal FL hyperparameters and unique hyperparameters for all MHeteroFL algorithms. For FL hyperparameters, we test MHeteroFL algorithms with a $\{64,128,256,512\}$ batch size, $\{1,10\}$ epochs, $T=\{100,500\}$ communication rounds and an SGD optimizer with a $0.01$ learning rate. The unique hyperparameter of FedMRL is the representation dimension $d_{1}$ of the homogeneous global small model, we vary $d_{1}=\{100,\ 150,...,500\}$ to obtain the best-performing FedMRL.
5.2 Results and Discussion
We design three FL settings with different numbers of clients ( $N$ ) and client participation rates ( $C$ ): ( $N=10,C=100\%$ ), ( $N=50,C=20\%$ ), ( $N=100,C=10\%$ ) for both model-homogeneous and model-heterogeneous FL scenarios.
5.2.1 Average Test Accuracy
Table 1 and Table 3 (Appendix C.2) show that FedMRL consistently outperforms all baselines under both model-heterogeneous or homogeneous settings. It achieves up to a $8.48\%$ improvement in average test accuracy compared with the best baseline under each setting. Furthermore, it achieves up to a $24.94\%$ average test accuracy improvement than the best same-category (i.e., mutual learning-based MHeteroFL) baseline under each setting. These results demonstrate the superiority of FedMRL in model performance owing to its adaptive personalized representation fusion and multi-granularity representation learning capabilities. Figure 3 (left six) shows that FedMRL consistently achieves faster convergence speed and higher average test accuracy than the best baseline under each setting.
5.2.2 Individual Client Test Accuracy
Figure 3 (right two) shows the difference between the test accuracy achieved by FedMRL vs. the best-performing baseline FedProto (i.e., FedMRL - FedProto) under ( $N=100,C=10\%$ ) for each individual client. It can be observed that $87\%$ and $99\%$ of all clients achieve better performance under FedMRL than under FedProto on CIFAR-10 and CIFAR-100, respectively. This demonstrates that FedMRL possesses stronger personalization capability than FedProto owing to its adaptive personalized multi-granularity representation learning design.
Table 1: Average test accuracy (%) in model-heterogeneous FL.
| FL Setting Method Standalone | N=10, C=100% CIFAR-10 96.53 | N=50, C=20% CIFAR-100 72.53 | N=100, C=10% CIFAR-10 95.14 | CIFAR-100 62.71 | CIFAR-10 91.97 | CIFAR-100 53.04 |
| --- | --- | --- | --- | --- | --- | --- |
| LG-FedAvg [24] | 96.30 | 72.20 | 94.83 | 60.95 | 91.27 | 45.83 |
| FD [19] | 96.21 | - | - | - | - | - |
| FedProto [41] | 96.51 | 72.59 | 95.48 | 62.69 | 92.49 | 53.67 |
| FML [38] | 30.48 | 16.84 | - | 21.96 | - | 15.21 |
| FedKD [43] | 80.20 | 53.23 | 77.37 | 44.27 | 73.21 | 37.21 |
| FedAPEN [34] | - | - | - | - | - | - |
| FedMRL | 96.63 | 74.37 | 95.70 | 66.04 | 95.85 | 62.15 |
| FedMRL -Best B. | 0.10 | 1.78 | 0.22 | 3.33 | 3.36 | 8.48 |
| FedMRL -Best S.C.B. | 16.43 | 21.14 | 18.33 | 21.77 | 22.64 | 24.94 |
“-”: failing to converge. “ ”: the best MHeteroFL method. “ Best B.”: the best baseline. “ Best S.C.B.”: the best same-category (mutual learning-based MHeteroFL) baseline. The underscored values denote the largest accuracy improvement of FedMRL across $6$ settings.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Line Chart: Test Accuracy vs. Communication Round
### Overview
This line chart displays the test accuracy of two models, "Standalone" and "FedMRL", over a series of communication rounds. The chart is titled "N=10, CIFAR-10", indicating the experimental setup involves 10 participants and the CIFAR-10 dataset. A shaded gray region appears at the beginning of the chart, likely representing an initial training phase.
### Components/Axes
* **X-axis:** "Communication Round" - Scale ranges from 0 to 500, with markers at 0, 200, and 400.
* **Y-axis:** "Test Accuracy" - Scale ranges from 87.5 to 95.5, with markers at 87.5, 90.0, 92.5, 95.0, and 95.5.
* **Legend:** Located in the bottom-right corner.
* "Standalone" - Represented by an orange square with a dashed orange line.
* "FedMRL" - Represented by a purple star with a solid purple line.
* **Title:** "N=10, CIFAR-10" - Positioned at the top-center of the chart.
### Detailed Analysis
**Standalone (Orange Dashed Line):**
The line begins at approximately 87.75 (Communication Round = 0). It rapidly increases to approximately 95.25 (Communication Round = 50) and then plateaus, remaining relatively stable between 95.0 and 95.5 for the remainder of the communication rounds.
* Communication Round 0: ~87.75
* Communication Round 50: ~95.25
* Communication Round 100: ~95.25
* Communication Round 200: ~95.25
* Communication Round 300: ~95.25
* Communication Round 400: ~95.25
* Communication Round 500: ~95.25
**FedMRL (Purple Solid Line):**
The line starts at approximately 89.5 (Communication Round = 0). It exhibits a steep increase up to approximately 95.25 (Communication Round = 50), similar to the "Standalone" model. After this initial increase, it also plateaus, fluctuating slightly between 95.25 and 95.5 for the remaining communication rounds.
* Communication Round 0: ~89.5
* Communication Round 50: ~95.25
* Communication Round 100: ~95.5
* Communication Round 200: ~95.5
* Communication Round 300: ~95.5
* Communication Round 400: ~95.5
* Communication Round 500: ~95.5
### Key Observations
Both models achieve similar peak test accuracies (around 95.25-95.5). The "FedMRL" model demonstrates a slightly faster initial convergence, reaching a higher accuracy at earlier communication rounds compared to the "Standalone" model. After the initial convergence, both models maintain a stable level of accuracy. The shaded region at the beginning of the chart suggests a period of rapid learning or adaptation for both models.
### Interpretation
The chart demonstrates the performance of a "Standalone" model and a "FedMRL" model on the CIFAR-10 dataset with 10 participants. The rapid increase in test accuracy during the initial communication rounds indicates effective learning. The subsequent plateau suggests that the models have converged and further communication rounds do not significantly improve performance. The slight advantage of "FedMRL" in the initial stages suggests that the federated learning approach may offer faster convergence compared to the standalone training. The high final accuracy of both models indicates that both approaches are effective for this task. The fact that both lines converge to a similar accuracy suggests that the benefits of FedMRL are primarily in the initial learning phase, rather than achieving a fundamentally higher accuracy ceiling.
</details>
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Chart: Test Accuracy vs. Communication Round
### Overview
This line chart displays the test accuracy of two federated learning methods, FedProto and FedMRL, as a function of the communication round. The chart is labeled with "N=50, CIFAR-10" at the top, indicating the number of clients (N=50) and the dataset used (CIFAR-10).
### Components/Axes
* **X-axis:** Communication Round (ranging from approximately 0 to 500, with markers at 0, 100, 200, 300, 400, and 500).
* **Y-axis:** Test Accuracy (ranging from approximately 60 to 95, with markers at 60, 70, 80, 90).
* **Data Series 1:** FedProto (represented by a light-green line with circular markers).
* **Data Series 2:** FedMRL (represented by a light-purple line with star-shaped markers).
* **Legend:** Located in the bottom-right corner, clearly labeling each data series with its corresponding color and marker.
### Detailed Analysis
* **FedProto (Green Line):** The line starts at approximately 68% accuracy at Communication Round 0. It rapidly increases to around 88% accuracy by Communication Round 100. The accuracy plateaus around 92% between Communication Rounds 200 and 500, with minor fluctuations.
* (0, 68)
* (100, 88)
* (200, 91)
* (300, 92)
* (400, 92)
* (500, 92)
* **FedMRL (Purple Line):** The line begins at approximately 62% accuracy at Communication Round 0. It shows a steeper initial increase than FedProto, reaching around 89% accuracy by Communication Round 100. The accuracy continues to increase, reaching approximately 93% by Communication Round 200, and then plateaus around 93-94% between Communication Rounds 200 and 500.
* (0, 62)
* (100, 89)
* (200, 93)
* (300, 93)
* (400, 94)
* (500, 93)
### Key Observations
* Both FedProto and FedMRL demonstrate a significant increase in test accuracy with increasing communication rounds.
* FedMRL initially outperforms FedProto in terms of accuracy gain, but both methods converge to similar accuracy levels after approximately 200 communication rounds.
* The accuracy plateaus for both methods after 200 communication rounds, suggesting diminishing returns from further communication.
### Interpretation
The chart demonstrates the effectiveness of both FedProto and FedMRL in improving model accuracy through federated learning on the CIFAR-10 dataset with 50 clients. The initial rapid increase in accuracy indicates that the models are quickly learning from the distributed data. The eventual plateau suggests that the models have reached a point of convergence, where further communication does not significantly improve performance. The slight initial advantage of FedMRL could be due to its specific optimization strategy, but the ultimate convergence to similar accuracy levels suggests that both methods are viable options for this task. The N=50 and CIFAR-10 parameters provide context for the performance observed, and the results may vary with different dataset sizes or numbers of clients. The data suggests that a communication round of around 200 is sufficient to achieve good performance, and further communication may not be necessary.
</details>
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Line Chart: Test Accuracy vs. Communication Round
### Overview
This line chart depicts the test accuracy of two federated learning methods, FedProto and FedMRL, over a series of communication rounds. The chart is titled "N=100, CIFAR-10", indicating the dataset and potentially the number of clients involved.
### Components/Axes
* **X-axis:** Communication Round (ranging from 0 to approximately 450, with tick marks at 0, 200, and 400).
* **Y-axis:** Test Accuracy (ranging from approximately 50 to 95, with tick marks at 60, 80, and 90).
* **Data Series 1:** FedProto (represented by a dashed light blue line with circular markers).
* **Data Series 2:** FedMRL (represented by a dashed purple line with star-shaped markers).
* **Legend:** Located in the bottom-right corner, labeling the two data series with their corresponding colors and markers.
* **Title:** "N=100, CIFAR-10" positioned at the top-center of the chart.
### Detailed Analysis
**FedProto (Light Blue, Circles):**
The line representing FedProto shows an initial steep increase in test accuracy from 0 to approximately 100 communication rounds, reaching around 75% accuracy. The slope then gradually decreases, continuing to rise but at a slower rate.
* At Communication Round 0: Approximately 55% accuracy.
* At Communication Round 100: Approximately 75% accuracy.
* At Communication Round 200: Approximately 83% accuracy.
* At Communication Round 300: Approximately 87% accuracy.
* At Communication Round 400: Approximately 90% accuracy.
* At Communication Round 450: Approximately 91% accuracy.
**FedMRL (Purple, Stars):**
The line representing FedMRL also exhibits a rapid initial increase in test accuracy, but it appears to be slightly faster than FedProto. It reaches a higher peak accuracy than FedProto.
* At Communication Round 0: Approximately 50% accuracy.
* At Communication Round 100: Approximately 80% accuracy.
* At Communication Round 200: Approximately 88% accuracy.
* At Communication Round 300: Approximately 92% accuracy.
* At Communication Round 400: Approximately 93% accuracy.
* At Communication Round 450: Approximately 94% accuracy.
### Key Observations
* FedMRL consistently outperforms FedProto in terms of test accuracy across all communication rounds.
* Both methods demonstrate diminishing returns in accuracy as the number of communication rounds increases. The rate of improvement slows down significantly after 200 rounds.
* Both methods start with low accuracy (around 50-55%) and converge towards a high accuracy (around 90-95%).
### Interpretation
The chart suggests that FedMRL is a more effective federated learning method than FedProto for the CIFAR-10 dataset with N=100 clients. The faster initial convergence and higher peak accuracy of FedMRL indicate that it can learn more efficiently from the distributed data. The diminishing returns observed in both methods suggest that there is a limit to the benefits of continued communication after a certain point. This could be due to factors such as data redundancy or the saturation of model capacity. The initial low accuracy suggests that the models start with limited knowledge and require a significant number of communication rounds to learn meaningful patterns from the data. The difference in performance between the two methods could be attributed to differences in their algorithms, such as the way they aggregate local model updates or handle data heterogeneity.
</details>
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Scatter Plot: Accuracy Variance vs. Client ID
### Overview
This image presents a scatter plot visualizing the relationship between Client ID and Accuracy Variance. Two distinct data series are represented using different colored triangles. The plot includes a title indicating the dataset used (CIFAR-10) and the number of samples (N=100). A percentage value (+: 87%) is also displayed near the legend.
### Components/Axes
* **X-axis:** Client ID, ranging from approximately 0 to 100.
* **Y-axis:** Accuracy Variance, ranging from approximately -10 to 12.
* **Data Series 1:** Represented by green upward-pointing triangles.
* **Data Series 2:** Represented by red downward-pointing triangles.
* **Title:** "N=100, CIFAR-10" positioned at the top-center of the plot.
* **Legend:** A single entry indicating "+: 87%" associated with the green triangles, positioned at the top-right.
* **Horizontal Line:** A gray dashed horizontal line at approximately y=0.
### Detailed Analysis
The plot displays data points for 100 clients.
**Data Series 1 (Green Triangles):**
The green data series generally exhibits positive accuracy variance. The trend is initially high, around a variance of 10-12, and then gradually decreases towards the right side of the plot, settling around a variance of 2-6.
* Approximate data points (Client ID, Accuracy Variance):
* (0, 11)
* (10, 9)
* (20, 7)
* (30, 6)
* (40, 5)
* (50, 4)
* (60, 3)
* (70, 4)
* (80, 6)
* (90, 8)
* (100, 11)
**Data Series 2 (Red Triangles):**
The red data series shows mostly negative or near-zero accuracy variance. The data points are scattered around the horizontal line at y=0.
* Approximate data points (Client ID, Accuracy Variance):
* (0, -1)
* (10, 0)
* (20, -2)
* (30, 1)
* (40, -1)
* (50, -12)
* (60, 0)
* (70, 1)
* (80, -1)
* (90, 0)
* (100, -1)
### Key Observations
* The green data series consistently shows positive accuracy variance, while the red data series shows a mix of negative and near-zero variance.
* The accuracy variance for the green series decreases as the Client ID increases.
* The red series is more scattered around the zero variance line.
* There is a significant outlier in the red series at approximately (50, -12).
### Interpretation
The plot suggests a performance difference between the two groups of clients (represented by the green and red data series). The green series indicates clients with consistently positive accuracy variance, implying a generally good performance. The decreasing trend in the green series might suggest diminishing returns or a saturation effect as the Client ID increases. The red series, with its scattered data points and negative variance, suggests clients with inconsistent or poor performance. The outlier at (50, -12) represents a client with significantly lower accuracy variance, potentially indicating a problem or anomaly. The "+: 87%" likely refers to the overall accuracy of the green data series, or a metric related to it. The data suggests that the model performs better on some clients than others, and there is a noticeable performance disparity between the two groups. The use of CIFAR-10 indicates this is likely a machine learning model evaluation, and the Client ID could represent different training or testing scenarios.
</details>
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Line Chart: Test Accuracy vs. Communication Round
### Overview
This line chart depicts the test accuracy of two federated learning methods, FedProto and FedMRL, over a series of communication rounds. The chart is titled "N=10, CIFAR-100", indicating the number of clients (N=10) and the dataset used (CIFAR-100).
### Components/Axes
* **X-axis:** Communication Round, ranging from 0 to 500, with tick marks at 0, 200, and 400.
* **Y-axis:** Test Accuracy, ranging from 0 to 80, with tick marks at 20, 40, 60, and 80.
* **Legend:** Located in the bottom-center of the chart.
* FedProto (represented by a teal/cyan dashed line with circle markers)
* FedMRL (represented by a purple dashed line with star markers)
* **Title:** "N=10, CIFAR-100" positioned at the top-center of the chart.
* **Gridlines:** Vertical dashed gridlines are present to aid in reading values.
### Detailed Analysis
* **FedProto (Teal/Cyan Line):** The line starts at approximately 20% test accuracy at Communication Round 0. It exhibits a steep upward slope initially, reaching approximately 65% accuracy around Communication Round 100. The slope then gradually decreases, leveling off around 75% accuracy between Communication Rounds 300 and 500.
* Round 0: ~20%
* Round 100: ~65%
* Round 200: ~71%
* Round 300: ~74%
* Round 400: ~75%
* Round 500: ~75%
* **FedMRL (Purple Line):** The line starts at approximately 55% test accuracy at Communication Round 0. It shows a rapid increase initially, reaching approximately 78% accuracy around Communication Round 50. The slope then decreases significantly, leveling off around 78-79% accuracy between Communication Rounds 200 and 500.
* Round 0: ~55%
* Round 50: ~78%
* Round 100: ~79%
* Round 200: ~79%
* Round 300: ~79%
* Round 400: ~79%
* Round 500: ~79%
### Key Observations
* FedMRL consistently outperforms FedProto across all communication rounds.
* Both methods exhibit diminishing returns in terms of accuracy improvement as the number of communication rounds increases.
* FedMRL reaches a higher plateau in accuracy compared to FedProto.
* FedProto shows a more gradual increase in accuracy over time.
### Interpretation
The chart demonstrates the performance of two federated learning algorithms, FedProto and FedMRL, on the CIFAR-100 dataset with 10 clients. FedMRL achieves higher test accuracy and converges faster than FedProto. The leveling off of both curves suggests that further communication rounds yield minimal improvements in accuracy, indicating a point of diminishing returns. The initial rapid increase in accuracy for both methods likely represents the initial learning phase where the models quickly adapt to the data. The difference in performance between the two methods suggests that FedMRL's approach to federated learning is more effective for this specific dataset and client configuration. The fact that FedMRL starts at a higher accuracy suggests it may be less sensitive to initial model conditions or have a more efficient learning process. The chart provides evidence supporting the claim that FedMRL is a superior method for federated learning in this scenario.
</details>
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Line Chart: Test Accuracy vs. Communication Round
### Overview
This chart displays the test accuracy of two methods, "Standalone" and "FedMRL", over a series of communication rounds. The chart is titled "N=50, CIFAR-100", indicating the parameters used in the experiment. The x-axis represents the communication round, and the y-axis represents the test accuracy.
### Components/Axes
* **Title:** N=50, CIFAR-100 (Top-center)
* **X-axis:** Communication Round (Bottom-center). Scale ranges from 0 to 500, with markers at 0, 100, 200, 300, 400, and 500.
* **Y-axis:** Test Accuracy (Left-center). Scale ranges from 0 to 70, with markers at 0, 20, 40, and 60.
* **Legend:** Located in the bottom-right corner.
* Standalone (Orange, square marker)
* FedMRL (Purple, star marker)
* **Data Series 1:** Standalone (Orange line with square markers)
* **Data Series 2:** FedMRL (Purple line with star markers)
### Detailed Analysis
**Standalone (Orange):**
The orange line representing "Standalone" starts at approximately 45 at Communication Round 0. It increases rapidly to approximately 55 at Communication Round 100. After round 100, the line plateaus, fluctuating around 55-60.
* Round 0: ~45
* Round 100: ~55
* Round 200: ~57
* Round 300: ~58
* Round 400: ~59
* Round 500: ~60
**FedMRL (Purple):**
The purple line representing "FedMRL" starts at approximately 15 at Communication Round 0. It increases rapidly to approximately 55 at Communication Round 100, mirroring the Standalone method's initial increase. After round 100, the line also plateaus, fluctuating around 60-65.
* Round 0: ~15
* Round 100: ~55
* Round 200: ~60
* Round 300: ~62
* Round 400: ~63
* Round 500: ~64
### Key Observations
* Both methods show a significant increase in test accuracy during the first 100 communication rounds.
* After 100 rounds, the accuracy of both methods plateaus, with only minor fluctuations.
* FedMRL consistently achieves a higher test accuracy than Standalone after the initial 100 rounds.
* The initial accuracy of FedMRL is significantly lower than Standalone.
### Interpretation
The data suggests that both the "Standalone" and "FedMRL" methods are effective in improving test accuracy, but "FedMRL" ultimately outperforms "Standalone" after an initial learning phase. The initial lower accuracy of "FedMRL" could be due to the overhead of the federated learning process, requiring initial communication rounds to synchronize and stabilize. The plateauing of both curves after 100 rounds indicates that the models are converging and further communication rounds yield diminishing returns. The parameters "N=50, CIFAR-100" suggest that the experiment was conducted with 50 clients and the CIFAR-100 dataset, a common benchmark for image classification. The difference in performance between the two methods could be attributed to the benefits of federated learning, such as increased data diversity and privacy preservation. The chart demonstrates the convergence of both models, and the superior performance of FedMRL.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Line Chart: Test Accuracy vs. Communication Round
### Overview
This line chart depicts the test accuracy of two federated learning protocols, FedProto and FedMRL, as a function of the communication round. The chart is titled "N=100, CIFAR-100", indicating the dataset and potentially the number of clients involved.
### Components/Axes
* **X-axis:** Communication Round (ranging from 0 to approximately 450, with markers at 0, 100, 200, 300, 400, and 450).
* **Y-axis:** Test Accuracy (ranging from 0 to 60, with markers at 0, 20, 40, and 60).
* **Data Series 1:** FedProto (represented by a light-green dashed line with circular markers).
* **Data Series 2:** FedMRL (represented by a purple dashed line with star-shaped markers).
* **Legend:** Located in the bottom-right corner, clearly labeling each data series with its corresponding color and marker.
* **Title:** "N=100, CIFAR-100" positioned at the top-center of the chart.
### Detailed Analysis
**FedProto (Light-Green, Circles):**
The FedProto line starts at approximately 10 at a communication round of 0. It increases rapidly to around 40 at a communication round of 100. The line continues to increase, but at a decreasing rate, reaching approximately 48 at a communication round of 400. It plateaus around 48-50 for the remainder of the observed rounds.
* Round 0: ~10
* Round 100: ~40
* Round 200: ~44
* Round 300: ~46
* Round 400: ~48
* Round 450: ~48
**FedMRL (Purple, Stars):**
The FedMRL line also starts at approximately 10 at a communication round of 0. It exhibits a steeper initial increase than FedProto, reaching approximately 50 at a communication round of 100. The line continues to climb, reaching a peak of approximately 58 at a communication round of 300. It then slightly decreases to around 57 at a communication round of 450.
* Round 0: ~10
* Round 100: ~50
* Round 200: ~54
* Round 300: ~58
* Round 400: ~57
* Round 450: ~57
### Key Observations
* FedMRL consistently outperforms FedProto across all communication rounds.
* Both protocols exhibit diminishing returns in accuracy as the number of communication rounds increases.
* FedMRL reaches its peak accuracy around round 300 and then slightly declines, while FedProto plateaus.
### Interpretation
The chart demonstrates the performance of two federated learning protocols, FedProto and FedMRL, on the CIFAR-100 dataset with N=100 clients. FedMRL achieves higher test accuracy than FedProto, suggesting it is a more effective protocol for this specific scenario. The diminishing returns observed in both protocols indicate that there is a limit to the improvement achievable through further communication rounds. The slight decline in FedMRL's accuracy after round 300 could be due to overfitting or other factors related to the training process. The initial steep increase in both lines suggests rapid learning in the early stages of training. The difference in the learning curves suggests that FedMRL is able to converge to a better solution, or is less prone to overfitting. The fact that both start at the same accuracy suggests they have similar initial conditions.
</details>
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Scatter Plot: Accuracy Variance vs. Client ID
### Overview
The image presents a scatter plot visualizing the relationship between Client ID and Accuracy Variance. The data points are represented by green triangles. A single red triangle is present at the beginning of the x-axis. A horizontal dashed gray line is present at y=0. The plot is titled with "N=100, CIFAR-100" and "+/- 99%".
### Components/Axes
* **X-axis:** Client ID, ranging from approximately 0 to 100.
* **Y-axis:** Accuracy Variance, ranging from approximately 0 to 22.
* **Title:** N=100, CIFAR-100, +/- 99%
* **Data Points:** Green triangles representing individual client data.
* **Outlier:** A single red triangle at Client ID 0.
* **Horizontal Line:** A dashed gray line at Accuracy Variance = 0.
### Detailed Analysis
The scatter plot shows a generally positive correlation between Client ID and Accuracy Variance, although with significant scatter.
* **Trend:** The data points generally increase in Accuracy Variance as Client ID increases, but there are many fluctuations. The variance appears to be higher for Client IDs between 20 and 80.
* **Data Points (Green Triangles):**
* Client ID 0: Accuracy Variance ~ 10
* Client ID 10: Accuracy Variance ~ 12
* Client ID 20: Accuracy Variance ~ 16
* Client ID 30: Accuracy Variance ~ 18
* Client ID 40: Accuracy Variance ~ 14
* Client ID 50: Accuracy Variance ~ 17
* Client ID 60: Accuracy Variance ~ 15
* Client ID 70: Accuracy Variance ~ 13
* Client ID 80: Accuracy Variance ~ 11
* Client ID 90: Accuracy Variance ~ 9
* Client ID 100: Accuracy Variance ~ 8
* **Outlier (Red Triangle):**
* Client ID 0: Accuracy Variance ~ -2. This point is significantly below the general trend and the horizontal line at 0.
* **Horizontal Line:** The dashed gray line at Accuracy Variance = 0 serves as a baseline for comparison. Most data points are above this line, indicating positive accuracy variance.
### Key Observations
* The majority of clients exhibit positive accuracy variance.
* There is a significant outlier at Client ID 0 with negative accuracy variance.
* The accuracy variance is not consistent across all clients, showing substantial variability.
* The data suggests a slight upward trend in accuracy variance as Client ID increases, but the relationship is not strong.
### Interpretation
The plot likely represents the performance of a machine learning model (trained on CIFAR-100 with N=100 clients) across different clients. Accuracy Variance measures the spread of accuracy scores for each client. The positive variance for most clients suggests that the model's performance varies across different data distributions or conditions associated with each client. The outlier at Client ID 0 indicates a client with significantly lower and potentially negative accuracy variance, which could be due to data quality issues, a unique data distribution, or a problem with the client's setup. The "+/- 99%" likely refers to a confidence interval or a statistical threshold related to the data. The horizontal line at 0 indicates the baseline for variance, and the fact that most points are above it suggests that the model generally performs better than random chance for most clients. The scatter suggests that the model's performance is sensitive to client-specific factors.
</details>
Figure 3: Left six: average test accuracy vs. communication rounds. Right two: individual clients’ test accuracy (%) differences (FedMRL - FedProto).
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Bar Chart: Communication Rounds for Federated Learning Algorithms
### Overview
This is a bar chart comparing the number of communication rounds required by two federated learning algorithms, FedProto and FedMRL, on two datasets: CIFAR-10 and CIFAR-100. The chart visually represents the performance of each algorithm in terms of convergence speed, as measured by the number of communication rounds needed.
### Components/Axes
* **X-axis:** Datasets - CIFAR-10 and CIFAR-100.
* **Y-axis:** Communication Rounds - Scale ranges from 0 to 350, with increments of 50.
* **Legend:**
* FedProto (represented by a light yellow color)
* FedMRL (represented by a dark blue color)
* **Chart Title:** Not explicitly present.
### Detailed Analysis
The chart consists of four bars, two for each dataset.
**CIFAR-10:**
* **FedProto:** The bar for FedProto on CIFAR-10 is approximately 350 communication rounds. The bar extends nearly to the top of the chart's y-axis.
* **FedMRL:** The bar for FedMRL on CIFAR-10 is approximately 180 communication rounds. It is significantly shorter than the FedProto bar.
**CIFAR-100:**
* **FedProto:** The bar for FedProto on CIFAR-100 is approximately 250 communication rounds.
* **FedMRL:** The bar for FedMRL on CIFAR-100 is approximately 130 communication rounds.
### Key Observations
* FedProto consistently requires more communication rounds than FedMRL for both datasets.
* The difference in communication rounds is more pronounced on the CIFAR-10 dataset.
* Both algorithms require fewer communication rounds on CIFAR-100 compared to CIFAR-10.
### Interpretation
The data suggests that FedMRL converges faster than FedProto on both CIFAR-10 and CIFAR-100 datasets, as indicated by the lower number of communication rounds required. The larger difference observed on CIFAR-10 might indicate that FedMRL is more robust to the increased complexity of the dataset. The fact that both algorithms require fewer rounds on CIFAR-100 compared to CIFAR-10 could be due to the simpler nature of CIFAR-100 (fewer classes). This chart demonstrates the efficiency of FedMRL in federated learning scenarios, particularly when dealing with complex datasets like CIFAR-10. The data implies that FedMRL may be a preferable choice when minimizing communication costs is a priority.
</details>
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Bar Chart: Communication Parameters Comparison
### Overview
This is a bar chart comparing the number of communication parameters for two methods, FedProto and FedMRL, across two datasets, CIFAR-10 and CIFAR-100. The chart uses grouped bars to represent the parameter counts for each method within each dataset.
### Components/Axes
* **X-axis:** Datasets - CIFAR-10 and CIFAR-100.
* **Y-axis:** Number of Communication Parameters (Num. of Comm. Paras.), scaled from 0 to 1e8 (100,000,000).
* **Legend:**
* FedProto (represented by a light yellow color)
* FedMRL (represented by a dark blue color)
### Detailed Analysis
The chart consists of two groups of bars, one for CIFAR-10 and one for CIFAR-100.
**CIFAR-10:**
* **FedProto:** The bar is very short, approximately 0.02e8 (2,000,000).
* **FedMRL:** The bar is significantly taller, approximately 0.78e8 (78,000,000).
**CIFAR-100:**
* **FedProto:** The bar is very short, approximately 0.02e8 (2,000,000).
* **FedMRL:** The bar is significantly taller, approximately 0.95e8 (95,000,000).
The bars are positioned side-by-side within each dataset group, allowing for direct comparison between FedProto and FedMRL.
### Key Observations
* FedMRL consistently requires a substantially larger number of communication parameters than FedProto for both datasets.
* The number of communication parameters for FedMRL is similar for both CIFAR-10 and CIFAR-100, while FedProto remains consistently low.
* The difference in communication parameters between the two methods is much more pronounced than the difference between the datasets.
### Interpretation
The data suggests that FedMRL, while potentially offering other benefits, is significantly more communication-intensive than FedProto. This could be a critical consideration in scenarios where communication bandwidth is limited or expensive. The relatively stable parameter count for FedMRL across both datasets indicates that the increased communication overhead is inherent to the method itself, rather than being dataset-specific. The consistently low parameter count for FedProto suggests it is a more efficient method in terms of communication costs. This chart highlights a trade-off between communication efficiency and potentially other performance metrics (which are not shown in this chart). Further investigation would be needed to determine if the increased communication cost of FedMRL is justified by improvements in accuracy, convergence speed, or other relevant factors.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Bar Chart: Computation FLOPs Comparison
### Overview
This bar chart compares the computational FLOPs (Floating Point Operations Per Second) required by two methods, FedProto and FedMRL, on two datasets: CIFAR-10 and CIFAR-100. The chart uses paired bars for each dataset to represent the FLOPs for each method.
### Components/Axes
* **X-axis:** Datasets - CIFAR-10 and CIFAR-100.
* **Y-axis:** Computation FLOPs, scaled from 0 to 1e9 (1 billion).
* **Legend:**
* FedProto (represented by a light yellow color)
* FedMRL (represented by a dark blue color)
* **Chart Title:** Not explicitly present, but the chart's purpose is clear from the axes and legend.
### Detailed Analysis
The chart consists of two pairs of bars, one for each dataset.
**CIFAR-10:**
* **FedProto:** The bar for FedProto on CIFAR-10 reaches approximately 4.8e9 FLOPs. The bar is light yellow, matching the legend.
* **FedMRL:** The bar for FedMRL on CIFAR-10 reaches approximately 2.5e9 FLOPs. The bar is dark blue, matching the legend.
**CIFAR-100:**
* **FedProto:** The bar for FedProto on CIFAR-100 reaches approximately 3.5e9 FLOPs. The bar is light yellow, matching the legend.
* **FedMRL:** The bar for FedMRL on CIFAR-100 reaches approximately 1.7e9 FLOPs. The bar is dark blue, matching the legend.
### Key Observations
* FedProto consistently requires more FLOPs than FedMRL for both datasets.
* The difference in FLOPs between the two methods is more pronounced on the CIFAR-10 dataset than on the CIFAR-100 dataset.
* The FLOPs required for both methods are higher on CIFAR-10 than on CIFAR-100.
### Interpretation
The data suggests that FedProto is computationally more expensive than FedMRL. This could be due to differences in the algorithms or the complexity of the models used by each method. The higher computational cost on CIFAR-10 might be related to the smaller number of classes in the dataset, potentially leading to more complex model interactions. The chart highlights a trade-off between computational efficiency and potentially other factors like model accuracy or convergence speed, which are not directly represented in this chart. The data suggests that if computational resources are limited, FedMRL might be a more suitable choice, while if computational resources are abundant, FedProto could be considered. Further investigation would be needed to understand the reasons behind these differences and to determine the optimal method for a given application.
</details>
Figure 4: Communication rounds, number of communicated parameters, and computation FLOPs required to reach $90\%$ and $50\%$ average test accuracy targets on CIFAR-10 and CIFAR-100.
5.2.3 Communication Cost
We record the communication rounds and the number of parameters sent per client to achieve $90\%$ and $50\%$ target test average accuracy on CIFAR-10 and CIFAR-100, respectively. Figure 4 (left) shows that FedMRL requires fewer rounds and achieves faster convergence than FedProto. Figure 4 (middle) shows that FedMRL incurs higher communication costs than FedProto as it transmits the full homogeneous small model, while FedProto only transmits each local seen-class average representation between the server and the client. Nevertheless, FedMRL with an optional smaller representation dimension ( $d_{1}$ ) of the homogeneous small model still achieves higher communication efficiency than same-category mutual learning-based MHeteroFL baselines (FML, FedKD, FedAPEN) with a larger representation dimension.
5.2.4 Computation Overhead
We also calculate the computation FLOPs consumed per client to reach $90\%$ and $50\%$ target average test accuracy on CIFAR-10 and CIFAR-100, respectively. Figure 4 (right) shows that FedMRL incurs lower computation costs than FedProto, owing to its faster convergence (i.e., fewer rounds) even with higher computation overhead per round due to the need to train an additional homogeneous small model and a linear representation projector.
5.3 Case Studies
5.3.1 Robustness to Non-IIDness (Class)
We evaluate the robustness of FedMRL to different non-IIDnesses as a result of the number of classes assigned to each client under the ( $N=100,C=10\%$ ) setting. The fewer classes assigned to each client, the higher the non-IIDness. For CIFAR-10, we assign $\{2,4,...,10\}$ classes out of total $10$ classes to each client. For CIFAR-100, we assign $\{10,30,...,100\}$ classes out of total $100$ classes to each client. Figure 5 (left two) shows that FedMRL consistently achieves higher average test accuracy than the best-performing baseline - FedProto on both datasets, demonstrating its robustness to non-IIDness by class.
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Line Chart: CIFAR-10 Test Accuracy vs. Number of Classes
### Overview
This line chart displays the test accuracy of two federated learning protocols, FedProto and FedMRL, as a function of the number of classes in the CIFAR-10 dataset. The chart shows how performance degrades as the number of classes increases.
### Components/Axes
* **Title:** CIFAR-10 (centered at the top)
* **X-axis:** Number of Classes (ranging from 2 to 10, with markers at 2, 4, 6, 8, and 10)
* **Y-axis:** Test Accuracy (ranging from approximately 40 to 90, with markers at 40, 60, 80)
* **Legend:** Located in the bottom-left corner.
* FedProto (represented by a light blue dashed line with circular markers)
* FedMRL (represented by a purple solid line with star-shaped markers)
### Detailed Analysis
**FedProto (Light Blue, Circles):**
The FedProto line slopes downward.
* At 2 classes: Approximately 85% test accuracy.
* At 4 classes: Approximately 73% test accuracy.
* At 6 classes: Approximately 62% test accuracy.
* At 8 classes: Approximately 52% test accuracy.
* At 10 classes: Approximately 40% test accuracy.
**FedMRL (Purple, Stars):**
The FedMRL line also slopes downward, but less steeply than FedProto.
* At 2 classes: Approximately 88% test accuracy.
* At 4 classes: Approximately 82% test accuracy.
* At 6 classes: Approximately 80% test accuracy.
* At 8 classes: Approximately 77% test accuracy.
* At 10 classes: Approximately 64% test accuracy.
### Key Observations
* Both FedProto and FedMRL exhibit a decrease in test accuracy as the number of classes increases.
* FedMRL consistently outperforms FedProto across all tested numbers of classes.
* The performance drop is more pronounced for FedProto than for FedMRL.
* The difference in performance between the two protocols widens as the number of classes increases.
### Interpretation
The data suggests that both federated learning protocols struggle with increased class complexity in the CIFAR-10 dataset. The decline in accuracy indicates that as the number of classes grows, it becomes more challenging for the models to generalize effectively. FedMRL's superior performance suggests that its approach is more robust to this increased complexity. This could be due to the specific mechanisms employed by FedMRL to handle multi-class learning, such as better regularization or more effective knowledge transfer between clients. The widening gap in performance as the number of classes increases highlights the importance of developing federated learning algorithms that can scale effectively to handle complex, high-dimensional datasets. The chart demonstrates a clear trade-off between the number of classes and the achievable test accuracy in a federated learning setting.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Line Chart: CIFAR-100 Test Accuracy vs. Number of Classes
### Overview
This line chart depicts the relationship between the number of classes and test accuracy for two different federated learning methods: FedProto and FedMRL, evaluated on the CIFAR-100 dataset. The chart shows how the accuracy of each method degrades as the number of classes increases.
### Components/Axes
* **Title:** CIFAR-100
* **X-axis:** Number of Classes (ranging from 10 to 100, with markers at 10, 30, 50, 70, 90, and 100)
* **Y-axis:** Test Accuracy (ranging from 0 to 60, with markers at 0, 20, 40, and 60)
* **Legend:**
* FedProto (represented by a light blue dashed line with circle markers)
* FedMRL (represented by a purple solid line with star markers)
### Detailed Analysis
**FedProto (Light Blue Dashed Line):**
The FedProto line slopes downward overall.
* At 10 classes, the test accuracy is approximately 52%.
* At 30 classes, the test accuracy is approximately 28%.
* At 50 classes, the test accuracy is approximately 18%.
* At 70 classes, the test accuracy is approximately 14%.
* At 90 classes, the test accuracy is approximately 12%.
* At 100 classes, the test accuracy is approximately 10%.
**FedMRL (Purple Solid Line):**
The FedMRL line also slopes downward overall, but initially starts at a higher accuracy than FedProto.
* At 10 classes, the test accuracy is approximately 60%.
* At 30 classes, the test accuracy is approximately 32%.
* At 50 classes, the test accuracy is approximately 18%.
* At 70 classes, the test accuracy is approximately 16%.
* At 90 classes, the test accuracy is approximately 14%.
* At 100 classes, the test accuracy is approximately 12%.
### Key Observations
* Both FedProto and FedMRL experience a significant drop in test accuracy as the number of classes increases.
* FedMRL consistently outperforms FedProto across all tested numbers of classes.
* The rate of accuracy decline appears to slow down as the number of classes approaches 100 for both methods.
* The initial difference in accuracy between the two methods is substantial, but the gap narrows as the number of classes increases.
### Interpretation
The data suggests that both federated learning methods struggle with increasing class complexity. The CIFAR-100 dataset, with its 100 classes, presents a significant challenge for both FedProto and FedMRL. The superior performance of FedMRL indicates that it is more robust to the increased complexity, potentially due to its underlying mechanisms for handling diverse data distributions. The slowing rate of accuracy decline at higher class numbers might suggest a saturation point where adding more classes yields diminishing returns in terms of accuracy loss. This could be due to the models reaching their capacity to differentiate between the classes or the limitations of the federated learning setup itself. The chart highlights the importance of considering the number of classes when evaluating and deploying federated learning models, and suggests that FedMRL may be a more suitable choice for tasks with a large number of classes.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Line Chart: CIFAR-10 Test Accuracy vs. Alpha
### Overview
This line chart displays the relationship between the parameter alpha (α) and test accuracy for two different federated learning methods: FedProto and FedMRL, on the CIFAR-10 dataset. The chart shows how the test accuracy of each method changes as the value of alpha varies from 0.1 to 0.5.
### Components/Axes
* **Title:** CIFAR-10
* **X-axis:** α (Alpha) - Scale ranges from 0.1 to 0.5 with markers at 0.1, 0.2, 0.3, 0.4, and 0.5.
* **Y-axis:** Test Accuracy - Scale ranges from 30 to 70 with markers at 40, 50, and 60.
* **Data Series 1:** FedProto - Represented by a dashed light blue line with circular markers.
* **Data Series 2:** FedMRL - Represented by a solid purple line with star-shaped markers.
* **Legend:** Located in the top-left corner, identifying the two data series and their corresponding colors/markers.
### Detailed Analysis
**FedProto (Light Blue, Circles):**
The FedProto line slopes downward overall.
* At α = 0.1, Test Accuracy ≈ 43%.
* At α = 0.2, Test Accuracy ≈ 41%.
* At α = 0.3, Test Accuracy ≈ 40%.
* At α = 0.4, Test Accuracy ≈ 39%.
* At α = 0.5, Test Accuracy ≈ 38%.
**FedMRL (Purple, Stars):**
The FedMRL line shows a slight downward trend, but remains relatively stable.
* At α = 0.1, Test Accuracy ≈ 68%.
* At α = 0.2, Test Accuracy ≈ 66%.
* At α = 0.3, Test Accuracy ≈ 64%.
* At α = 0.4, Test Accuracy ≈ 63%.
* At α = 0.5, Test Accuracy ≈ 62%.
### Key Observations
* FedMRL consistently achieves significantly higher test accuracy than FedProto across all values of alpha.
* The accuracy of FedProto decreases steadily as alpha increases.
* The accuracy of FedMRL decreases slightly as alpha increases, but the change is less pronounced than for FedProto.
* The difference in accuracy between the two methods is most significant at lower values of alpha.
### Interpretation
The chart suggests that FedMRL is a more robust method than FedProto for the CIFAR-10 dataset, particularly when alpha is low. The parameter alpha likely controls some aspect of the federated learning process (e.g., weighting of local updates, regularization strength). The decreasing accuracy of FedProto as alpha increases could indicate that the method becomes less stable or more prone to overfitting with higher alpha values. The relatively stable performance of FedMRL suggests that it is less sensitive to changes in alpha. The consistent higher accuracy of FedMRL indicates it is a better performing algorithm for this dataset and parameter range. Further investigation would be needed to understand the specific role of alpha and why it affects the two methods differently.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Line Chart: CIFAR-100 Test Accuracy vs. Alpha
### Overview
This chart displays the relationship between test accuracy and the parameter alpha (α) for two different federated learning protocols: FedProto and FedMRL, on the CIFAR-100 dataset. The chart uses lines with markers to represent the data for each protocol.
### Components/Axes
* **Title:** CIFAR-100 (positioned at the top-center)
* **X-axis Label:** α (positioned at the bottom-center)
* **Y-axis Label:** Test Accuracy (positioned at the left-center)
* **Legend:** Located in the top-right corner.
* FedProto (represented by a dashed light-blue line with circular markers)
* FedMRL (represented by a solid purple line with star-shaped markers)
* **X-axis Markers:** 0.1, 0.2, 0.3, 0.4, 0.5
* **Y-axis Scale:** Ranges from approximately 9 to 16.
### Detailed Analysis
**FedProto (Light-Blue Dashed Line with Circles):**
The line slopes downward overall.
* At α = 0.1, Test Accuracy ≈ 12.1
* At α = 0.2, Test Accuracy ≈ 11.2
* At α = 0.3, Test Accuracy ≈ 10.1
* At α = 0.4, Test Accuracy ≈ 9.4
* At α = 0.5, Test Accuracy ≈ 9.6
**FedMRL (Purple Solid Line with Stars):**
The line initially slopes downward, then slightly upward.
* At α = 0.1, Test Accuracy ≈ 15.4
* At α = 0.2, Test Accuracy ≈ 14.3
* At α = 0.3, Test Accuracy ≈ 13.2
* At α = 0.4, Test Accuracy ≈ 12.1
* At α = 0.5, Test Accuracy ≈ 12.2
### Key Observations
* FedMRL consistently achieves higher test accuracy than FedProto across all values of α.
* FedProto's accuracy decreases steadily as α increases.
* FedMRL's accuracy decreases from α = 0.1 to α = 0.4, then shows a slight increase at α = 0.5.
* The difference in accuracy between the two protocols is most significant at lower values of α.
### Interpretation
The chart suggests that FedMRL is a more effective federated learning protocol than FedProto for the CIFAR-100 dataset, as it consistently yields higher test accuracy. The parameter α appears to have a negative impact on the performance of FedProto, while its effect on FedMRL is more complex, potentially indicating a sweet spot for α within the range tested. The slight increase in FedMRL accuracy at α = 0.5 could be due to the regularization effect of α, preventing overfitting. The data suggests that the choice of α is crucial for optimizing the performance of federated learning algorithms, and that different protocols may respond differently to changes in this parameter. The difference in performance between the two protocols could be attributed to differences in their underlying mechanisms for handling data heterogeneity or communication constraints.
</details>
Figure 5: Robustness to non-IIDness (Class & Dirichlet).
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Line Chart: CIFAR-10 Test Accuracy vs. d1
### Overview
This image presents a line chart illustrating the relationship between a parameter 'd1' and the test accuracy achieved using a method called "FedMRL" on the CIFAR-10 dataset. The chart displays a fluctuating trend, showing how test accuracy changes as 'd1' varies.
### Components/Axes
* **Title:** CIFAR-10 (positioned at the top-center)
* **X-axis:** Labeled as 'd1' (positioned at the bottom-center). The scale ranges from approximately 0 to 500, with markers at 0, 100, 200, 300, 400, and 500.
* **Y-axis:** Labeled as 'Test Accuracy' (positioned at the left-center). The scale ranges from approximately 93 to 96, with gridlines at 93, 94, and 95.
* **Legend:** Located in the top-right corner. It contains:
* Label: "FedMRL"
* Line Style: Dashed line
* Marker: Star symbol
* Color: Purple
### Detailed Analysis
The chart shows a single data series representing "FedMRL". The line is composed of star-shaped markers connected by dashed purple lines. The trend is initially decreasing, then increasing.
Here's a breakdown of approximate data points, reading from left to right:
* d1 = 0: Test Accuracy ≈ 95.2
* d1 = 100: Test Accuracy ≈ 93.2
* d1 = 200: Test Accuracy ≈ 95.8
* d1 = 300: Test Accuracy ≈ 94.5
* d1 = 400: Test Accuracy ≈ 93.5
* d1 = 500: Test Accuracy ≈ 94.3
The line initially drops sharply from d1=0 to d1=100, then rises significantly to d1=200. It then decreases again to d1=400, and finally shows a slight increase to d1=500.
### Key Observations
* The lowest test accuracy is observed around d1 = 100 and d1 = 400.
* The highest test accuracy is observed around d1 = 200.
* The data exhibits significant fluctuations, suggesting a sensitive relationship between 'd1' and test accuracy.
### Interpretation
The chart suggests that the parameter 'd1' has a non-monotonic effect on the test accuracy of the FedMRL method when applied to the CIFAR-10 dataset. Increasing 'd1' does not consistently improve or degrade performance; instead, there are optimal and suboptimal values. The large fluctuations indicate that the model's performance is highly sensitive to the value of 'd1'.
The parameter 'd1' likely represents a hyperparameter or a characteristic of the data distribution used in the Federated Machine Learning (FedMRL) process. The optimal value of 'd1' appears to be around 200, where the model achieves its highest test accuracy. The dips in accuracy at d1=100 and d1=400 could indicate overfitting or underfitting issues, or a mismatch between the data distribution and the model's capacity. Further investigation would be needed to understand the specific meaning of 'd1' and the underlying reasons for these performance variations.
</details>
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Line Chart: CIFAR-100 Test Accuracy vs. d1
### Overview
This image presents a line chart illustrating the relationship between a parameter 'd1' and the test accuracy achieved on the CIFAR-100 dataset, using the FedMRL method. The chart displays a fluctuating trend of test accuracy as 'd1' varies.
### Components/Axes
* **Title:** CIFAR-100 (positioned at the top-center)
* **X-axis:** Labeled as 'd1' (positioned at the bottom-center). The scale ranges from approximately 0 to 500, with markers at 0, 100, 200, 300, 400, and 500.
* **Y-axis:** Labeled as 'Test Accuracy' (positioned on the left-center). The scale ranges from approximately 57 to 62, with markers at 57, 58, 59, 60, 61, and 62.
* **Legend:** Located in the top-right corner. It contains one entry:
* Label: "FedMRL"
* Color: Light purple (#D8BFD8)
* Marker: Star (*)
### Detailed Analysis
The chart displays a single data series representing the FedMRL method. The line is dashed and connects star-shaped data points.
Here's a breakdown of the approximate data points, reading from left to right:
* d1 = 0: Test Accuracy ≈ 61.8%
* d1 = 100: Test Accuracy ≈ 60.5%
* d1 = 200: Test Accuracy ≈ 57.5%
* d1 = 300: Test Accuracy ≈ 58.5%
* d1 = 400: Test Accuracy ≈ 57.2%
* d1 = 500: Test Accuracy ≈ 59.5%
The line initially slopes downward from d1 = 0 to d1 = 200, then exhibits fluctuations with a slight upward trend between d1 = 200 and d1 = 500.
### Key Observations
* The highest test accuracy is achieved at the beginning of the range (d1 = 0).
* The lowest test accuracy is observed around d1 = 200 and d1 = 400.
* The test accuracy appears to recover somewhat towards the end of the range (d1 = 500).
* The fluctuations suggest that the parameter 'd1' has a non-monotonic effect on test accuracy.
### Interpretation
The chart suggests that the parameter 'd1' in the FedMRL method has a complex relationship with the test accuracy on the CIFAR-100 dataset. Initially, increasing 'd1' leads to a decrease in accuracy. However, beyond a certain point, the accuracy starts to fluctuate and eventually shows a slight recovery. This could indicate that 'd1' represents a regularization parameter, and too much regularization initially hurts performance, but a moderate amount can lead to better generalization. The fluctuations might be due to the inherent noise in the training process or the sensitivity of the method to the specific value of 'd1'. Further investigation would be needed to understand the optimal value of 'd1' and the underlying reasons for the observed behavior. The data suggests that the method is not robust to changes in 'd1' across the entire range.
</details>
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Line Chart: CIFAR-10 Test Accuracy vs. d1
### Overview
This line chart displays the test accuracy of a model on the CIFAR-10 dataset, comparing performance with and without a feature called "MRL" (likely a regularization technique). The x-axis represents a parameter 'd1', and the y-axis represents the test accuracy. Two lines are plotted, one for the model without MRL and one for the model with MRL.
### Components/Axes
* **Title:** CIFAR-10 (top-center)
* **X-axis Label:** d1 (bottom-center)
* **X-axis Markers:** 100, 200, 300, 400, 500
* **Y-axis Label:** Test Accuracy (left-center)
* **Y-axis Scale:** Ranges from approximately 93 to 96.
* **Legend:** Located in the top-right corner.
* "w/o MRL" - Represented by a blue dashed line with square markers.
* "w/ MRL" - Represented by a purple solid line with star markers.
### Detailed Analysis
**Line 1: w/o MRL (Blue Dashed Line)**
This line shows a relatively flat trend.
* At d1 = 100, Test Accuracy ≈ 93.4
* At d1 = 200, Test Accuracy ≈ 93.6
* At d1 = 300, Test Accuracy ≈ 93.8
* At d1 = 400, Test Accuracy ≈ 93.2
* At d1 = 500, Test Accuracy ≈ 93.7
**Line 2: w/ MRL (Purple Solid Line)**
This line exhibits a more pronounced trend, initially increasing and then decreasing.
* At d1 = 100, Test Accuracy ≈ 95.2
* At d1 = 200, Test Accuracy ≈ 96.1
* At d1 = 300, Test Accuracy ≈ 94.4
* At d1 = 400, Test Accuracy ≈ 93.5
* At d1 = 500, Test Accuracy ≈ 94.6
### Key Observations
* The model *with* MRL generally achieves higher test accuracy than the model *without* MRL, especially at lower values of d1 (100 and 200).
* The accuracy of the model *with* MRL peaks at d1 = 200, then declines at d1 = 300 and d1 = 400, before slightly recovering at d1 = 500.
* The model *without* MRL shows minimal variation in accuracy across the range of d1 values.
### Interpretation
The data suggests that the MRL feature improves the model's test accuracy on the CIFAR-10 dataset, but its effectiveness is dependent on the value of the parameter 'd1'. At lower values of 'd1', MRL provides a significant boost in accuracy. However, as 'd1' increases beyond 200, the benefit of MRL diminishes, and at d1 = 400, the accuracy drops to a level comparable to the model without MRL. This could indicate that 'd1' represents a hyperparameter that interacts with MRL, and there's an optimal range for 'd1' where MRL is most effective. The flat performance of the model without MRL suggests it is less sensitive to changes in 'd1'. The peak at d1=200 for the MRL model suggests a sweet spot where the regularization is most effective. Further investigation would be needed to understand the nature of 'd1' and its relationship to the MRL feature.
</details>
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Line Chart: CIFAR-100 Test Accuracy vs. d1
### Overview
This line chart displays the test accuracy of a model on the CIFAR-100 dataset, plotted against a parameter 'd1'. Two data series are presented: one with MRL (likely a regularization technique) and one without. The chart aims to demonstrate the impact of MRL on model performance as 'd1' varies.
### Components/Axes
* **Title:** CIFAR-100
* **X-axis:** d1, ranging from 100 to 500, with markers at 100, 200, 300, 400, and 500.
* **Y-axis:** Test Accuracy, ranging from approximately 54 to 62, with markers at 55, 56, 57, 58, 59, 60, 61.
* **Legend:** Located in the top-left corner.
* "w/o MRL" (without MRL) - represented by a blue dashed line with square markers.
* "w/ MRL" (with MRL) - represented by a purple solid line with star markers.
### Detailed Analysis
**Data Series 1: w/o MRL (Blue Dashed Line)**
The line slopes downward from d1 = 100 to d1 = 300, then slopes upward from d1 = 300 to d1 = 500.
* At d1 = 100, Test Accuracy ≈ 57.5%.
* At d1 = 200, Test Accuracy ≈ 55.5%.
* At d1 = 300, Test Accuracy ≈ 54%.
* At d1 = 400, Test Accuracy ≈ 56%.
* At d1 = 500, Test Accuracy ≈ 54.5%.
**Data Series 2: w/ MRL (Purple Solid Line)**
The line slopes upward from d1 = 100 to d1 = 200, then remains relatively flat from d1 = 200 to d1 = 400, and slopes upward again from d1 = 400 to d1 = 500.
* At d1 = 100, Test Accuracy ≈ 59%.
* At d1 = 200, Test Accuracy ≈ 58%.
* At d1 = 300, Test Accuracy ≈ 57%.
* At d1 = 400, Test Accuracy ≈ 57.5%.
* At d1 = 500, Test Accuracy ≈ 59%.
### Key Observations
* The "w/ MRL" line consistently shows higher test accuracy than the "w/o MRL" line across all values of d1.
* The "w/o MRL" line exhibits a more pronounced fluctuation in test accuracy as d1 changes, indicating greater sensitivity to this parameter.
* The "w/ MRL" line demonstrates a more stable performance, with a smaller range of accuracy values.
* The lowest accuracy for the "w/o MRL" line is at d1 = 300.
* The highest accuracy for the "w/ MRL" line is at d1 = 100 and d1 = 500.
### Interpretation
The data suggests that incorporating MRL (whatever that may be) improves the model's test accuracy on the CIFAR-100 dataset, and also makes the model less sensitive to changes in the 'd1' parameter. The parameter 'd1' appears to have a non-monotonic effect on the model's performance when MRL is not used, with accuracy decreasing initially and then increasing. The consistent improvement with MRL suggests it acts as a regularizer, preventing overfitting or stabilizing the learning process. The fact that the accuracy with MRL is relatively stable across different values of 'd1' indicates that MRL makes the model more robust to variations in this parameter. The chart provides evidence that MRL is a beneficial addition to the model training process for this specific dataset and parameter range.
</details>
Figure 6: Left two: sensitivity analysis results. Right two: ablation study results.
5.3.2 Robustness to Non-IIDness (Dirichlet)
We also test the robustness of FedMRL to various non-IIDnesses controlled by $\alpha$ in the Dirichlet function under the ( $N=100,C=10\%$ ) setting. A smaller $\alpha$ indicates a higher non-IIDness. For both datasets, we vary $\alpha$ in the range of $\{0.1,...,0.5\}$ . Figure 5 (right two) shows that FedMRL significantly outperforms FedProto under all non-IIDness settings, validating its robustness to Dirichlet non-IIDness.
5.3.3 Sensitivity Analysis - $d_{1}$
FedMRL relies on a hyperparameter $d_{1}$ - the representation dimension of the homogeneous small model. To evaluate its sensitivity to $d_{1}$ , we test FedMRL with $d_{1}=\{100,150,...,500\}$ under the ( $N=100,C=10\%$ ) setting. Figure 6 (left two) shows that smaller $d_{1}$ values result in higher average test accuracy on both datasets. It is clear that a smaller $d_{1}$ also reduces communication and computation overheads, thereby helping FedMRL achieve the best trade-off among model performance, communication efficiency, and computational efficiency.
5.4 Ablation Study
We conduct ablation experiments to validate the usefulness of MRL. For FedMRL with MRL, the global header and the local header learn multi-granularity representations. For FedMRL without MRL, we directly input the representation fused by the representation projector into the client’s local header for loss computation (i.e., we do not extract Matryoshka Representations and remove the global header). Figure 6 (right two) shows that FedMRL with MRL consistently outperforms FedMRL without MRL, demonstrating the effectiveness of the design to incorporate MRL into MHeteroFL. Besides, the accuracy gap between them decreases as $d_{1}$ rises. This shows that as the global and local headers learn increasingly overlapping representation information, the benefits of MRL are reduced.
6 Conclusions
This paper proposes a novel MHeteroFL approach - FedMRL - to jointly address data, system and model heterogeneity challenges in FL. The key design insight is the addition of a global homogeneous small model shared by FL clients for enhanced knowledge interaction among heterogeneous local models. Adaptive personalized representation fusion and multi-granularity Matryoshka Representations learning further boosts model learning capability. The client and the server only need to exchange the homogeneous small model, while the clients’ heterogeneous local models and data remain unexposed, thereby enhancing the preservation of both model and data privacy. Theoretical analysis shows that FedMRL is guaranteed to converge over time. Extensive experiments demonstrate that FedMRL significantly outperforms state-of-the-art models regarding test accuracy, while incurring low communication and computation costs. Appendix D discusses FedMRL ’s privacy, communication and computation. Appendix E elaborates FedMRL ’s border impact and limitations.
References
- [1] Jin-Hyun Ahn et al. Wireless federated distillation for distributed edge learning with heterogeneous data. In Proc. PIMRC, pages 1–6, Istanbul, Turkey, 2019. IEEE.
- [2] Jin-Hyun Ahn et al. Cooperative learning VIA federated distillation OVER fading channels. In Proc. ICASSP, pages 8856–8860, Barcelona, Spain, 2020. IEEE.
- [3] Samiul Alam et al. Fedrolex: Model-heterogeneous federated learning with rolling sub-model extraction. In Proc. NeurIPS, virtual, 2022. .
- [4] Sara Babakniya et al. Revisiting sparsity hunting in federated learning: Why does sparsity consensus matter? Transactions on Machine Learning Research, 1(1):1, 2023.
- [5] Yun-Hin Chan, Rui Zhou, Running Zhao, Zhihan JIANG, and Edith C. H. Ngai. Internal cross-layer gradients for extending homogeneity to heterogeneity in federated learning. In Proc. ICLR, page 1, Vienna, Austria, 2024. OpenReview.net.
- [6] Hongyan Chang et al. Cronus: Robust and heterogeneous collaborative learning with black-box knowledge transfer. In Proc. NeurIPS Workshop, virtual, 2021. .
- [7] Jiangui Chen et al. Fedmatch: Federated learning over heterogeneous question answering data. In Proc. CIKM, pages 181–190, virtual, 2021. ACM.
- [8] Sijie Cheng et al. Fedgems: Federated learning of larger server models via selective knowledge fusion. CoRR, abs/2110.11027, 2021.
- [9] Yae Jee Cho et al. Heterogeneous ensemble knowledge transfer for training large models in federated learning. In Proc. IJCAI, pages 2881–2887, virtual, 2022. ijcai.org.
- [10] Liam Collins et al. Exploiting shared representations for personalized federated learning. In Proc. ICML, volume 139, pages 2089–2099, virtual, 2021. PMLR.
- [11] Enmao Diao. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In Proc. ICLR, page 1, Virtual Event, Austria, 2021. OpenReview.net.
- [12] Xuan Gong et al. Federated learning via input-output collaborative distillation. In Proc. AAAI, pages 22058–22066, Vancouver, Canada, 2024. AAAI Press.
- [13] Chaoyang He et al. Group knowledge transfer: Federated learning of large cnns at the edge. In Proc. NeurIPS, virtual, 2020. .
- [14] S. Horváth. FjORD: Fair and accurate federated learning under heterogeneous targets with ordered dropout. In Proc. NIPS, pages 12876–12889, Virtual, 2021. OpenReview.net.
- [15] Wenke Huang et al. Few-shot model agnostic federated learning. In Proc. MM, pages 7309–7316, Lisboa, Portugal, 2022. ACM.
- [16] Wenke Huang et al. Learn from others and be yourself in heterogeneous federated learning. In Proc. CVPR, pages 10133–10143, virtual, 2022. IEEE.
- [17] Sohei Itahara et al. Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-iid private data. IEEE Trans. Mob. Comput., 22(1):191–205, 2023.
- [18] Jaehee Jang et al. Fedclassavg: Local representation learning for personalized federated learning on heterogeneous neural networks. In Proc. ICPP, pages 76:1–76:10, virtual, 2022. ACM.
- [19] Eunjeong Jeong et al. Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data. In Proc. NeurIPS Workshop on Machine Learning on the Phone and other Consumer Devices, virtual, 2018. .
- [20] Alex Krizhevsky et al. Learning multiple layers of features from tiny images. Toronto, ON, Canada, , 2009.
- [21] Aditya Kusupati et al. Matryoshka representation learning. In Proc. NeurIPS, New Orleans, LA, USA, 2022.
- [22] Daliang Li and Junpu Wang. Fedmd: Heterogenous federated learning via model distillation. In Proc. NeurIPS Workshop, virtual, 2019. .
- [23] Qinbin Li et al. Practical one-shot federated learning for cross-silo setting. In Proc. IJCAI, pages 1484–1490, virtual, 2021. ijcai.org.
- [24] Paul Pu Liang et al. Think locally, act globally: Federated learning with local and global representations. arXiv preprint arXiv:2001.01523, 1(1), 2020.
- [25] Tao Lin et al. Ensemble distillation for robust model fusion in federated learning. In Proc. NeurIPS, virtual, 2020. .
- [26] Chang Liu et al. Completely heterogeneous federated learning. CoRR, abs/2210.15865, 2022.
- [27] Disha Makhija et al. Architecture agnostic federated learning for neural networks. In Proc. ICML, volume 162, pages 14860–14870, virtual, 2022. PMLR.
- [28] Koji Matsuda et al. Fedme: Federated learning via model exchange. In Proc. SDM, pages 459–467, Alexandria, VA, USA, 2022. SIAM.
- [29] Brendan McMahan et al. Communication-efficient learning of deep networks from decentralized data. In Proc. AISTATS, volume 54, pages 1273–1282, Fort Lauderdale, FL, USA, 2017. PMLR.
- [30] Duy Phuong Nguyen et al. Enhancing heterogeneous federated learning with knowledge extraction and multi-model fusion. In Proc. SC Workshop, pages 36–43, Denver, CO, USA, 2023. ACM.
- [31] Jaehoon Oh et al. Fedbabu: Toward enhanced representation for federated image classification. In Proc. ICLR, virtual, 2022. OpenReview.net.
- [32] Sejun Park et al. Towards understanding ensemble distillation in federated learning. In Proc. ICML, volume 202, pages 27132–27187, Honolulu, Hawaii, USA, 2023. PMLR.
- [33] Krishna Pillutla et al. Federated learning with partial model personalization. In Proc. ICML, volume 162, pages 17716–17758, virtual, 2022. PMLR.
- [34] Zhen Qin et al. Fedapen: Personalized cross-silo federated learning with adaptability to statistical heterogeneity. In Proc. KDD, pages 1954–1964, Long Beach, CA, USA, 2023. ACM.
- [35] Felix Sattler et al. Fedaux: Leveraging unlabeled auxiliary data in federated learning. IEEE Trans. Neural Networks Learn. Syst., 1(1):1–13, 2021.
- [36] Felix Sattler et al. CFD: communication-efficient federated distillation via soft-label quantization and delta coding. IEEE Trans. Netw. Sci. Eng., 9(4):2025–2038, 2022.
- [37] Aviv Shamsian et al. Personalized federated learning using hypernetworks. In Proc. ICML, volume 139, pages 9489–9502, virtual, 2021. PMLR.
- [38] Tao Shen et al. Federated mutual learning. CoRR, abs/2006.16765, 2020.
- [39] Xiaorong Shi, Liping Yi, Xiaoguang Liu, and Gang Wang. FFEDCL: fair federated learning with contrastive learning. In Proc. ICASSP, Rhodes Island, Greece,, pages 1–5. IEEE, 2023.
- [40] Alysa Ziying Tan et al. Towards personalized federated learning. IEEE Trans. Neural Networks Learn. Syst., 1(1):1–17, 2022.
- [41] Yue Tan et al. Fedproto: Federated prototype learning across heterogeneous clients. In Proc. AAAI, pages 8432–8440, virtual, 2022. AAAI Press.
- [42] Jiaqi Wang et al. Towards personalized federated learning via heterogeneous model reassembly. In Proc. NeurIPS, page 13, New Orleans, Louisiana, USA, 2023. OpenReview.net.
- [43] Chuhan Wu et al. Communication-efficient federated learning via knowledge distillation. Nature Communications, 13(1):2032, 2022.
- [44] Liping Yi, Xiaorong Shi, Nan Wang, Gang Wang, Xiaoguang Liu, Zhuan Shi, and Han Yu. pfedkt: Personalized federated learning with dual knowledge transfer. Knowledge-Based Systems, 292:111633, 2024.
- [45] Liping Yi, Xiaorong Shi, Nan Wang, Ziyue Xu, Gang Wang, and Xiaoguang Liu. pfedlhns: Personalized federated learning via local hypernetworks. In Proc. ICANN, volume 1, page 516–528. Springer, 2023.
- [46] Liping Yi, Xiaorong Shi, Nan Wang, Jinsong Zhang, Gang Wang, and Xiaoguang Liu. Fedpe: Adaptive model pruning-expanding for federated learning on mobile devices. IEEE Transactions on Mobile Computing, pages 1–18, 2024.
- [47] Liping Yi, Xiaorong Shi, Wenrui Wang, Gang Wang, and Xiaoguang Liu. Fedrra: Reputation-aware robust federated learning against poisoning attacks. In Proc. IJCNN, pages 1–8. IEEE, 2023.
- [48] Liping Yi, Gang Wang, and Xiaoguang Liu. QSFL: A two-level uplink communication optimization framework for federated learning. In Proc. ICML, volume 162, pages 25501–25513. PMLR, 2022.
- [49] Liping Yi, Gang Wang, Xiaoguang Liu, Zhuan Shi, and Han Yu. Fedgh: Heterogeneous federated learning with generalized global header. In Proceedings of the 31st ACM International Conference on Multimedia (ACM MM’23), page 11, Canada, 2023. ACM.
- [50] Liping Yi, Han Yu, Chao Ren, Heng Zhang, Gang Wang, Xiaoguang Liu, and Xiaoxiao Li. pfedafm: Adaptive feature mixture for batch-level personalization in heterogeneous federated learning. CoRR, abs/2404.17847, 2024.
- [51] Liping Yi, Han Yu, Chao Ren, Heng Zhang, Gang Wang, Xiaoguang Liu, and Xiaoxiao Li. pfedmoe: Data-level personalization with mixture of experts for model-heterogeneous personalized federated learning. CoRR, abs/2402.01350, 2024.
- [52] Liping Yi, Han Yu, Zhuan Shi, Gang Wang, Xiaoguang Liu, Lizhen Cui, and Xiaoxiao Li. FedSSA: Semantic Similarity-based Aggregation for Efficient Model-Heterogeneous Personalized Federated Learning. In IJCAI, 2024.
- [53] Liping Yi, Han Yu, Gang Wang, and Xiaoguang Liu. Fedlora: Model-heterogeneous personalized federated learning with lora tuning. CoRR, abs/2310.13283, 2023.
- [54] Liping Yi, Han Yu, Gang Wang, and Xiaoguang Liu. pfedes: Model heterogeneous personalized federated learning with feature extractor sharing. CoRR, abs/2311.06879, 2023.
- [55] Liping Yi, Jinsong Zhang, Rui Zhang, Jiaqi Shi, Gang Wang, and Xiaoguang Liu. Su-net: An efficient encoder-decoder model of federated learning for brain tumor segmentation. In Proc. ICANN, volume 12396, pages 761–773. Springer, 2020.
- [56] Fuxun Yu et al. Fed2: Feature-aligned federated learning. In Proc. KDD, pages 2066–2074, virtual, 2021. ACM.
- [57] Sixing Yu et al. Resource-aware federated learning using knowledge extraction and multi-model fusion. CoRR, abs/2208.07978, 2022.
- [58] Jianqing Zhang, Yang Liu, Yang Hua, and Jian Cao. Fedtgp: Trainable global prototypes with adaptive-margin-enhanced contrastive learning for data and model heterogeneity in federated learning. In Proc. AAAI, pages 16768–16776, Vancouver, Canada, 2024. AAAI Press.
- [59] Jie Zhang et al. Parameterized knowledge transfer for personalized federated learning. In Proc. NeurIPS, pages 10092–10104, virtual, 2021. OpenReview.net.
- [60] Jie Zhang et al. Towards data-independent knowledge transfer in model-heterogeneous federated learning. IEEE Trans. Computers, 72(10):2888–2901, 2023.
- [61] Lan Zhang et al. Fedzkt: Zero-shot knowledge transfer towards resource-constrained federated learning with heterogeneous on-device models. In Proc. ICDCS, pages 928–938, virtual, 2022. IEEE.
- [62] Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Proc. NeurIPS, pages 8792–8802, Montréal, Canada, 2018. Curran Associates Inc.
- [63] Zhuangdi Zhu et al. Data-free knowledge distillation for heterogeneous federated learning. In Proc. ICML, volume 139, pages 12878–12889, virtual, 2021. PMLR.
- [64] Zhuangdi Zhu et al. Resilient and communication efficient learning for heterogeneous federated systems. In Proc. ICML, volume 162, pages 27504–27526, virtual, 2022. PMLR.
Appendix A Pseudo codes of FedMRL
Input: $N$ , total number of clients; $K$ , number of selected clients in one round; $T$ , total number of rounds; $\eta_{\omega}$ , learning rate of client local heterogeneous models; $\eta_{\theta}$ , learning rate of homogeneous small model; $\eta_{\varphi}$ , learning rate of the representation projector.
Output: client whole models removing the global header $[\mathcal{G}(\theta^{ex,T-1})\circ\mathcal{F}_{0}(\omega_{0}^{T-1})|\mathcal{P%
}_{0}(\varphi_{0}^{T-1}),...,\mathcal{G}(\theta^{ex,T-1})\circ\mathcal{F}_{%
N-1}(\omega_{N-1}^{T-1})|\mathcal{P}_{N-1}(\varphi_{N-1}^{T-1})]$ .
Randomly initialize the global homogeneous small model $\mathcal{G}(\theta^{\mathbf{0}})$ , client local heterogeneous models $[\mathcal{F}_{0}(\omega_{0}^{0}),...,\mathcal{F}_{N-1}(\omega_{N-1}^{0})]$ and local heterogeneous representation projectors $[\mathcal{P}_{0}(\varphi_{0}^{0}),...,\mathcal{P}_{N-1}(\varphi_{N-1}^{0})]$ .
for each round t=1,…,T-1 do
// Server Side:
$S^{t}$ $←$ Randomly sample $K$ clients from $N$ clients;
Broadcast the global homogeneous small model $\theta^{t-1}$ to sampled $K$ clients;
$\theta_{k}^{t}←$ ClientUpdate ( $\theta^{t-1}$ );
/* Aggregate Local Homogeneous Small Models */
$\theta^{t}=\sum_{k=0}^{K-1}{\frac{n_{k}}{n}\theta_{k}^{t}}$ .
// ClientUpdate:
Receive the global homogeneous small model $\theta^{t-1}$ from the server;
for $k∈ S^{t}$ do
/* Local Training with MRL */
for $(\boldsymbol{x}_{i},y_{i})∈ D_{k}$ do
$\boldsymbol{\mathcal{R}}_{i}^{\mathcal{G}}=\ \mathcal{G}^{ex}({\boldsymbol{x}_%
{i};\theta}^{ex,t-1}),\boldsymbol{\mathcal{R}}_{i}^{\mathcal{F}_{k}}=\ %
\mathcal{F}_{k}^{ex}(\boldsymbol{x}_{i};\omega_{k}^{ex,t-1})$ ;
$\boldsymbol{\mathcal{R}}_{i}=\boldsymbol{\mathcal{R}}_{i}^{\mathcal{G}}\circ%
\boldsymbol{\mathcal{R}}_{i}^{\mathcal{F}_{k}}$ ;
${\widetilde{\boldsymbol{\mathcal{R}}}}_{i}=\mathcal{P}_{k}(\boldsymbol{%
\mathcal{R}}_{i}{;\varphi}_{k}^{t-1})$ ;
${\widetilde{\boldsymbol{\mathcal{R}}}}_{i}^{lc}={{\widetilde{\boldsymbol{%
\mathcal{R}}}}_{i}}^{1:d_{1}},{\widetilde{\boldsymbol{\mathcal{R}}}}_{i}^{hf}=%
{{\widetilde{\boldsymbol{\mathcal{R}}}}_{i}}^{1:d_{2}}$ ;
${\hat{{y}}}_{i}^{\mathcal{G}}=\mathcal{G}^{hd}({\widetilde{\boldsymbol{%
\mathcal{R}}}}_{i}^{lc};\theta^{hd,t-1});{\hat{{y}}}_{i}^{\mathcal{F}_{k}}=%
\mathcal{F}_{k}^{hd}(\omega_{k}^{hd,t-1})$ ;
$\ell_{i}^{\mathcal{G}}=\ell({\hat{{y}}}_{i}^{\mathcal{G}},y_{i});\ell_{i}^{%
\mathcal{F}_{k}}=\ell({\hat{{y}}}_{i}^{\mathcal{F}_{k}},y_{i})$ ;
$\ell_{i}=m_{i}^{\mathcal{G}}·\ell_{i}^{\mathcal{G}}+m_{i}^{\mathcal{F}_{k}%
}·\ell_{i}^{\mathcal{F}_{k}}$ ;
$\theta_{k}^{t}←\theta^{t-1}-\eta_{\theta}∇\ell_{i}$ ;
$\omega_{k}^{t}←\omega_{k}^{t-1}-\eta_{\omega}∇\ell_{i}$ ;
$\varphi_{k}^{t}←\varphi_{k}^{t-1}-\eta_{\varphi}∇\ell_{i}$ ;
end for
Upload updated local homogeneous small model $\theta_{k}^{t}$ to the server.
end for
end for
Algorithm 1 FedMRL
Appendix B Theoretical Proofs
We first define the following additional notations. $t∈\{0,...,T-1\}$ denotes the $t$ -th round. $e∈\{0,1,...,E\}$ denotes the $e$ -th iteration of local training. $tE+0$ indicates that clients receive the global homogeneous small model $\mathcal{G}(\theta^{t})$ from the server before the $(t+1)$ -th round’s local training. $tE+e$ denotes the $e$ -th iteration of the $(t+1)$ -th round’s local training. $tE+E$ marks the ending of the $(t+1)$ -th round’s local training. After that, clients upload their updated local homogeneous small models to the server for aggregation. $\mathcal{W}_{k}(w_{k})$ denotes the whole model trained on client $k$ , including the global homogeneous small model $\mathcal{G}(\theta)$ , the client $k$ ’s local heterogeneous model $\mathcal{F}_{k}(\omega_{k})$ , and the personalized representation projector $\mathcal{P}_{k}(\varphi_{k})$ . $\eta$ is the learning rate of the whole model trained on client $k$ , including $\{\eta_{\theta},\eta_{\omega},\eta_{\boldsymbol{\varphi}}\}$ .
**Assumption 1**
*Lipschitz Smoothness. The gradients of client $k$ ’s whole local model $w_{k}$ are $L1$ –Lipschitz smooth [41],
$$
\begin{gathered}\|\nabla\mathcal{L}_{k}^{t_{1}}(w_{k}^{t_{1}};\boldsymbol{x},y%
)-\nabla\mathcal{L}_{k}^{t_{2}}(w_{k}^{t_{2}};\boldsymbol{x},y)\|\leq L_{1}\|w%
_{k}^{t_{1}}-w_{k}^{t_{2}}\|,\\
\forall t_{1},t_{2}>0,k\in\{0,1,\ldots,N-1\},(\boldsymbol{x},y)\in D_{k}.\end{gathered} \tag{15}
$$
The above formulation can be re-expressed as:
$$
\mathcal{L}_{k}^{t_{1}}-\mathcal{L}_{k}^{t_{2}}\leq\langle\nabla\mathcal{L}_{k%
}^{t_{2}},(w_{k}^{t_{1}}-w_{k}^{t_{2}})\rangle+\frac{L_{1}}{2}\|w_{k}^{t_{1}}-%
w_{k}^{t_{2}}\|_{2}^{2}. \tag{16}
$$*
**Assumption 2**
*Unbiased Gradient and Bounded Variance. Client $k$ ’s random gradient $g_{w,k}^{t}=∇\mathcal{L}_{k}^{t}(w_{k}^{t};\mathcal{B}_{k}^{t})$ ( $\mathcal{B}$ is a batch of local data) is unbiased,
$$
\mathbb{E}_{\mathcal{B}_{k}^{t}\subseteq D_{k}}[g_{w,k}^{t}]=\nabla\mathcal{L}%
_{k}^{t}(w_{k}^{t}), \tag{17}
$$
and the variance of random gradient $g_{w,k}^{t}$ is bounded by:
$$
\begin{split}\mathbb{E}_{\mathcal{B}_{k}^{t}\subseteq D_{k}}[\|\nabla\mathcal{%
L}_{k}^{t}(w_{k}^{t};\mathcal{B}_{k}^{t})-\nabla\mathcal{L}_{k}^{t}(w_{k}^{t})%
\|_{2}^{2}]\leq\sigma^{2}.\end{split} \tag{18}
$$*
**Assumption 3**
*Bounded Parameter Variation. The parameter variations of the homogeneous small model $\theta_{k}^{t}$ and $\theta^{t}$ before and after aggregation at the FL server are bounded by:
$$
{\|\theta^{t}-\theta_{k}^{t}\|}_{2}^{2}\leq\delta^{2}. \tag{19}
$$*
B.1 Proof of Lemma 1
**Proof 1**
*An arbitrary client $k$ ’s local whole model $w$ can be updated by $w_{t+1}=w_{t}-\eta g_{w,t}$ in the (t+1)-th round, and following Assumption 1, we can obtain
$$
\displaystyle\mathcal{L}_{tE+1} \displaystyle\leq\mathcal{L}_{tE+0}+\langle\nabla\mathcal{L}_{tE+0},(w_{tE+1}-%
w_{tE+0})\rangle+\frac{L_{1}}{2}\|w_{tE+1}-w_{tE+0}\|_{2}^{2} \displaystyle=\mathcal{L}_{tE+0}-\eta\langle\nabla\mathcal{L}_{tE+0},g_{w,tE+0%
}\rangle+\frac{L_{1}\eta^{2}}{2}\|g_{w,tE+0}\|_{2}^{2}. \tag{20}
$$ Taking the expectation of both sides of the inequality concerning the random variable $\xi_{tE+0}$ ,
$$
\displaystyle\mathbb{E}[\mathcal{L}_{tE+1}] \displaystyle\leq\mathcal{L}_{tE+0}-\eta\mathbb{E}[\langle\nabla\mathcal{L}_{%
tE+0},g_{w,tE+0}\rangle]+\frac{L_{1}\eta^{2}}{2}\mathbb{E}[\|g_{w,tE+0}\|_{2}^%
{2}] \displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\mathcal{L}_{tE+0}-\eta\|\nabla%
\mathcal{L}_{tE+0}\|_{2}^{2}+\frac{L_{1}\eta^{2}}{2}\mathbb{E}[\|g_{w,tE+0}\|_%
{2}^{2}] \displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathcal{L}_{tE+0}-\eta\|%
\nabla\mathcal{L}_{tE+0}\|_{2}^{2}+\frac{L_{1}\eta^{2}}{2}(\mathbb{E}[\|g_{w,%
tE+0}\|]_{2}^{2}+\operatorname{Var}(g_{w,tE+0})) \displaystyle\stackrel{{\scriptstyle(c)}}{{=}}\mathcal{L}_{tE+0}-\eta\|\nabla%
\mathcal{L}_{tE+0}\|_{2}^{2}+\frac{L_{1}\eta^{2}}{2}(\|\nabla\mathcal{L}_{tE+0%
}\|_{2}^{2}+\operatorname{Var}(g_{w,tE+0})) \displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}\mathcal{L}_{tE+0}-\eta\|%
\nabla\mathcal{L}_{tE+0}\|_{2}^{2}+\frac{L_{1}\eta^{2}}{2}(\|\nabla\mathcal{L}%
_{tE+0}\|_{2}^{2}+\sigma^{2}) \displaystyle=\mathcal{L}_{tE+0}+(\frac{L_{1}\eta^{2}}{2}-\eta)\|\nabla%
\mathcal{L}_{tE+0}\|_{2}^{2}+\frac{L_{1}\eta^{2}\sigma^{2}}{2}. \tag{21}
$$ (a), (c), (d) follow Assumption 2 and (b) follows $Var(x)=\mathbb{E}[x^{2}]-(\mathbb{E}[x])^{2}$ . Taking the expectation of both sides of the inequality for the model $w$ over $E$ iterations, we obtain $$
\mathbb{E}[\mathcal{L}_{tE+1}]\leq\mathcal{L}_{tE+0}+(\frac{L_{1}\eta^{2}}{2}-%
\eta)\sum_{e=1}^{E}\|\nabla\mathcal{L}_{tE+e}\|_{2}^{2}+\frac{L_{1}E\eta^{2}%
\sigma^{2}}{2}. \tag{22}
$$*
B.2 Proof of Lemma 2
**Proof 2**
*$$
\displaystyle\mathcal{L}_{(t+1)E+0} \displaystyle=\mathcal{L}_{(t+1)E}+\mathcal{L}_{(t+1)E+0}-\mathcal{L}_{(t+1)E} \displaystyle\stackrel{{\scriptstyle(a)}}{{\approx}}\mathcal{L}_{(t+1)E}+\eta%
\|\theta_{(t+1)E+0}-\theta_{(t+1)E}\|_{2}^{2} \displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathcal{L}_{(t+1)E}+\eta%
\delta^{2}. \tag{23}
$$ (a): we can use the gradient of parameter variations to approximate the loss variations, i.e., $\Delta\mathcal{L}≈\eta·\|\Delta\theta\|_{2}^{2}$ . (b) follows Assumption 3. Taking the expectation of both sides of the inequality to the random variable $\xi$ , we obtain
$$
\mathbb{E}[\mathcal{L}_{(t+1)E+0}]\leq\mathbb{E}[\mathcal{L}_{tE+1}]+{\eta%
\delta}^{2}. \tag{24}
$$*
B.3 Proof of Theorem 1
**Proof 3**
*Substituting Lemma 1 into the right side of Lemma 2 ’s inequality, we obtain
$$
\mathbb{E}[\mathcal{L}_{(t+1)E+0}]\leq\mathcal{L}_{tE+0}+(\frac{L_{1}\eta^{2}}%
{2}-\eta)\sum_{e=0}^{E}\|\nabla\mathcal{L}_{tE+e}\|_{2}^{2}+\frac{L_{1}E\eta^{%
2}\sigma^{2}}{2}+\eta\delta^{2}. \tag{25}
$$*
B.4 Proof of Theorem 2
**Proof 4**
*Interchanging the left and right sides of Eq. (25), we obtain
$$
\sum_{e=0}^{E}\|\nabla\mathcal{L}_{tE+e}\|_{2}^{2}\leq\frac{\mathcal{L}_{tE+0}%
-\mathbb{E}[\mathcal{L}_{(t+1)E+0}]+\frac{L_{1}E\eta^{2}\sigma^{2}}{2}+\eta%
\delta^{2}}{\eta-\frac{L_{1}\eta^{2}}{2}}. \tag{26}
$$ Taking the expectation of both sides of the inequality over rounds $t=[0,T-1]$ to $w$ , we obtain
$$
\frac{1}{T}\sum_{t=0}^{T-1}\sum_{e=0}^{E-1}\|\nabla\mathcal{L}_{tE+e}\|_{2}^{2%
}\leq\frac{\frac{1}{T}\sum_{t=0}^{T-1}[\mathcal{L}_{tE+0}-\mathbb{E}[\mathcal{%
L}_{(t+1)E+0}]]+\frac{L_{1}E\eta^{2}\sigma^{2}}{2}+\eta\delta^{2}}{\eta-\frac{%
L_{1}\eta^{2}}{2}}. \tag{27}
$$ Let $\Delta=\mathcal{L}_{t=0}-\mathcal{L}^{*}>0$ , then $\sum_{t=0}^{T-1}[\mathcal{L}_{tE+0}-\mathbb{E}[\mathcal{L}_{(t+1)E+0}]]≤\Delta$ , we can get
$$
\frac{1}{T}\sum_{t=0}^{T-1}\sum_{e=0}^{E-1}\|\nabla\mathcal{L}_{tE+e}\|_{2}^{2%
}\leq\frac{\frac{\Delta}{T}+\frac{L_{1}E\eta^{2}\sigma^{2}}{2}+\eta\delta^{2}}%
{\eta-\frac{L_{1}\eta^{2}}{2}}. \tag{28}
$$ If the above equation converges to a constant $\epsilon$ , i.e., $$
\frac{\frac{\Delta}{T}+\frac{L_{1}E\eta^{2}\sigma^{2}}{2}+\eta\delta^{2}}{\eta%
-\frac{L_{1}\eta^{2}}{2}}<\epsilon, \tag{29}
$$
then
$$
T>\frac{\Delta}{\epsilon(\eta-\frac{L_{1}\eta^{2}}{2})-\frac{L_{1}E\eta^{2}%
\sigma^{2}}{2}-\eta\delta^{2}}. \tag{30}
$$ Since $T>0,\Delta>0$ , we can get
$$
\epsilon(\eta-\frac{L_{1}\eta^{2}}{2})-\frac{L_{1}E\eta^{2}\sigma^{2}}{2}-\eta%
\delta^{2}>0. \tag{31}
$$ Solving the above inequality yields $$
\eta<\frac{2(\epsilon-\delta^{2})}{L_{1}(\epsilon+E\sigma^{2})}. \tag{32}
$$ Since $\epsilon,\ L_{1},\ \sigma^{2},\ \delta^{2}$ are all constants greater than 0, $\eta$ has solutions. Therefore, when the learning rate $\eta=\{\eta_{\theta},\eta_{\omega},\eta_{\boldsymbol{\varphi}}\}$ satisfies the above condition, any client’s local whole model can converge. Since all terms on the right side of Eq. (28) except for $1/T$ are constants, hence FedMRL ’s non-convex convergence rate is $\epsilon\sim\mathcal{O}(1/T)$ .*
Appendix C More Experimental Details
Here, we provide more experimental details of used model structures, more experimental results of model-homogeneous FL scenarios, and also the experimental evidence of inference model selection.
C.1 Model Structures
Table 2 shows the structures of models used in experiments.
Table 2: Structures of $5$ heterogeneous CNN models.
| Layer Name Conv1 Maxpool1 | CNN-1 5 $×$ 5, 16 2 $×$ 2 | CNN-2 5 $×$ 5, 16 2 $×$ 2 | CNN-3 5 $×$ 5, 16 2 $×$ 2 | CNN-4 5 $×$ 5, 16 2 $×$ 2 | CNN-5 5 $×$ 5, 16 2 $×$ 2 |
| --- | --- | --- | --- | --- | --- |
| Conv2 | 5 $×$ 5, 32 | 5 $×$ 5, 16 | 5 $×$ 5, 32 | 5 $×$ 5, 32 | 5 $×$ 5, 32 |
| Maxpool2 | 2 $×$ 2 | 2 $×$ 2 | 2 $×$ 2 | 2 $×$ 2 | 2 $×$ 2 |
| FC1 | 2000 | 2000 | 1000 | 800 | 500 |
| FC2 | 500 | 500 | 500 | 500 | 500 |
| FC3 | 10/100 | 10/100 | 10/100 | 10/100 | 10/100 |
| model size | 10.00 MB | 6.92 MB | 5.04 MB | 3.81 MB | 2.55 MB |
Note: $5× 5$ denotes kernel size. $16$ or $32$ are filters in convolutional layers.
C.2 Homogeneous FL Results
Table 3 presents the results of FedMRL and baselines in model-homogeneous FL scenarios.
Table 3: Average test accuracy (%) in model-homogeneous FL.
| FL Setting Method Standalone | N=10, C=100% CIFAR-10 96.35 | N=50, C=20% CIFAR-100 74.32 | N=100, C=10% CIFAR-10 95.25 | CIFAR-100 62.38 | CIFAR-10 92.58 | CIFAR-100 54.93 |
| --- | --- | --- | --- | --- | --- | --- |
| LG-FedAvg [24] | 96.47 | 73.43 | 94.20 | 61.77 | 90.25 | 46.64 |
| FD [19] | 96.30 | - | - | - | - | - |
| FedProto [41] | 95.83 | 72.79 | 95.10 | 62.55 | 91.19 | 54.01 |
| FML [38] | 94.83 | 70.02 | 93.18 | 57.56 | 87.93 | 46.20 |
| FedKD [43] | 94.77 | 70.04 | 92.93 | 57.56 | 90.23 | 50.99 |
| FedAPEN [34] | 95.38 | 71.48 | 93.31 | 57.62 | 87.97 | 46.85 |
| FedMRL | 96.71 | 74.52 | 95.76 | 66.46 | 95.52 | 60.64 |
| FedMRL -Best B. | 0.24 | 0.20 | 0.51 | 3.91 | 2.94 | 5.71 |
| FedMRL -Best S.C.B. | 1.33 | 3.04 | 2.45 | 8.84 | 5.29 | 9.65 |
“-”: failing to converge. “ ”: the best MHeteroFL method. “ Best B.”: the best baseline. “ Best S.C.B.”: the best same-category (mutual learning-based MHeteroFL) baseline. The underscored values denote the largest accuracy improvement of FedMRL across $6$ settings.
C.3 Inference Model Comparison
There are $4$ alternative models for model inference in FedMRL: (1) mix-small (the combination of the homogeneous small model, the client heterogeneous model’s feature extractor, and the representation projector, i.e., removing the local header), (2) mix-large (the combination of the homogeneous small model’s feature extractor, the client heterogeneous model, and the representation projector, i.e., removing the global header), (3) single-small (the homogeneous small model), (4) single-large (the client heterogeneous model). We compare their model performances under $(N=100,C=10\%)$ settings. Figure 7 presents that mix-small has a similar accuracy to mix-large which is used as the default inference model, and they significantly outperform the single homogeneous small model and the single heterogeneous client model. Therefore, users can choose mix-small or mix-large for model inference based on their inference costs in practical applications.
<details>
<summary>x23.png Details</summary>

### Visual Description
## Line Chart: CIFAR-10 Test Accuracy vs. d1
### Overview
This line chart displays the relationship between the parameter 'd1' and 'Test Accuracy' for four different model configurations on the CIFAR-10 dataset. The configurations are 'Mix-S', 'Mix-L', 'Single-S', and 'Single-L'. The chart shows how test accuracy changes as 'd1' varies from approximately 100 to 500.
### Components/Axes
* **Title:** CIFAR-10
* **X-axis:** d1 (ranging from approximately 100 to 500, with tick marks at 100, 200, 300, 400, and 500)
* **Y-axis:** Test Accuracy (ranging from approximately 30 to 90, with tick marks at 40, 60, and 80)
* **Legend:** Located in the top-center of the chart.
* Mix-S (Blue, Circle with a dot)
* Mix-L (Purple, Star)
* Single-S (Orange, Square with a dot)
* Single-L (Green, Triangle)
### Detailed Analysis
* **Mix-S (Blue):** The line representing Mix-S starts at approximately 82% test accuracy at d1 = 100. It decreases slightly to around 78% at d1 = 200, remains relatively stable around 78-80% between d1 = 200 and 400, and then decreases to approximately 76% at d1 = 500.
* **Mix-L (Purple):** The line for Mix-L starts at approximately 88% test accuracy at d1 = 100. It remains remarkably stable, fluctuating slightly between 86% and 90% throughout the entire range of d1 (100 to 500).
* **Single-S (Orange):** The line for Single-S begins at approximately 32% test accuracy at d1 = 100. It increases to around 38% at d1 = 200, then decreases to approximately 36% at d1 = 300, remains around 32-36% between d1 = 300 and 400, and finally decreases to approximately 30% at d1 = 500.
* **Single-L (Green):** The line for Single-L starts at approximately 44% test accuracy at d1 = 100. It increases steadily to around 54% at d1 = 200, continues to increase to approximately 56% at d1 = 300, remains relatively stable around 54-56% between d1 = 300 and 400, and then increases to approximately 58% at d1 = 500.
### Key Observations
* Mix-L consistently achieves the highest test accuracy across all values of d1.
* Mix-S maintains a relatively high level of accuracy, but it decreases slightly as d1 increases.
* Single-S consistently exhibits the lowest test accuracy.
* Single-L shows a clear increasing trend in test accuracy as d1 increases.
* The difference in performance between Mix-L and Single-S is substantial, indicating a significant benefit from the 'Mix' approach.
### Interpretation
The chart demonstrates the impact of the 'd1' parameter on the test accuracy of different model configurations for the CIFAR-10 dataset. The 'Mix' configurations (Mix-S and Mix-L) generally outperform the 'Single' configurations (Single-S and Single-L), suggesting that the mixing strategy is effective in improving model performance. The stability of Mix-L's accuracy across varying 'd1' values indicates robustness. The increasing trend of Single-L suggests that increasing 'd1' can improve performance for this configuration, but it still lags significantly behind the 'Mix' configurations. The fact that Single-S remains consistently low suggests that this configuration is not well-suited for the CIFAR-10 dataset, or that 'd1' is not the primary factor influencing its performance. The chart suggests that the 'Mix' approach, particularly Mix-L, is the most promising strategy for achieving high test accuracy on CIFAR-10, and that the parameter 'd1' has a varying impact depending on the model configuration.
</details>
<details>
<summary>x24.png Details</summary>

### Visual Description
## Line Chart: CIFAR-100 Test Accuracy vs. d1
### Overview
This line chart displays the test accuracy of different model configurations on the CIFAR-100 dataset, plotted against the parameter 'd1'. Four distinct model configurations are compared: Mix-S, Mix-L, Single-S, and Single-L. The chart shows how test accuracy changes as 'd1' varies from 100 to 500.
### Components/Axes
* **Title:** CIFAR-100 (centered at the top)
* **X-axis:** Labeled "d1", ranging from 100 to 500, with tick marks at 100, 200, 300, 400, and 500.
* **Y-axis:** Labeled "Test Accuracy", ranging from 0 to 60, with tick marks at 0, 20, 40, and 60.
* **Legend:** Located in the top-center of the chart.
* Mix-S (Blue, Circle with dotted line)
* Mix-L (Purple, Star with dotted line)
* Single-S (Orange, Square with dotted line)
* Single-L (Green, Triangle with dotted line)
* **Gridlines:** Horizontal dashed lines at Test Accuracy values of 20 and 40. Vertical dashed lines at d1 values of 100, 200, 300, 400, and 500.
### Detailed Analysis
* **Mix-S (Blue):** The line representing Mix-S starts at approximately 58% accuracy at d1=100 and remains relatively stable, fluctuating slightly around 58% until d1=500.
* d1 = 100: ~58%
* d1 = 200: ~57%
* d1 = 300: ~58%
* d1 = 400: ~58%
* d1 = 500: ~58%
* **Mix-L (Purple):** The line for Mix-L also begins at approximately 58% accuracy at d1=100 and remains consistently around this level, with minimal variation, until d1=500.
* d1 = 100: ~58%
* d1 = 200: ~58%
* d1 = 300: ~58%
* d1 = 400: ~58%
* d1 = 500: ~58%
* **Single-S (Orange):** The Single-S line starts at approximately 10% accuracy at d1=100 and remains relatively flat, fluctuating around 10-12% throughout the range of d1 values.
* d1 = 100: ~10%
* d1 = 200: ~11%
* d1 = 300: ~10%
* d1 = 400: ~11%
* d1 = 500: ~10%
* **Single-L (Green):** The Single-L line begins at approximately 8% accuracy at d1=100 and remains consistently low, fluctuating around 8-10% across all d1 values.
* d1 = 100: ~8%
* d1 = 200: ~9%
* d1 = 300: ~8%
* d1 = 400: ~9%
* d1 = 500: ~8%
### Key Observations
* Mix-S and Mix-L consistently achieve significantly higher test accuracy (around 58%) compared to Single-S and Single-L (around 8-12%).
* The test accuracy for all four configurations remains relatively stable across the range of d1 values (100 to 500). There is no clear trend of increasing or decreasing accuracy with changes in d1.
* Single-L consistently exhibits the lowest test accuracy among the four configurations.
### Interpretation
The data suggests that the "Mix" configurations (Mix-S and Mix-L) are substantially more effective than the "Single" configurations (Single-S and Single-L) in achieving high test accuracy on the CIFAR-100 dataset. The parameter 'd1' appears to have a minimal impact on test accuracy within the tested range, indicating that the model performance is not highly sensitive to changes in this parameter. The consistent performance of Mix-S and Mix-L suggests that the mixing strategy employed in these configurations is beneficial for learning robust features from the CIFAR-100 data. The low accuracy of Single-S and Single-L may indicate that these configurations struggle to generalize well to the dataset, potentially due to limitations in their model capacity or training process. The lack of a clear trend with respect to 'd1' suggests that other factors, such as model architecture or training hyperparameters, may play a more significant role in determining the overall performance.
</details>
Figure 7: Accuracy of four optional inference models: mix-small (the whole model without the local header), mix-large (the whole model without the global header), single-small (the homogeneous small model), single-large (the client heterogeneous model).
Appendix D Discussion
We discuss how FedMRL tackles heterogeneity and its privacy, communication and computation.
Tackling Heterogeneity. FedMRL allows each client to tailor its heterogeneous local model according to its system resources, which addresses system and model heterogeneity. Each client achieves multi-granularity representation learning adapting to local non-IID data distribution through a personalized heterogeneous representation projector, alleviating data heterogeneity.
Privacy. The server and clients communicate the homogeneous small models while the heterogeneous local model is always stored in the client. Besides, representation splicing enables the structures of the homogeneous global model and the heterogeneous local model to be not related. Therefore, the parameters and structure privacy of the heterogeneous client model is protected strongly. Meanwhile, the local data are always stored in clients for local training, so local data privacy is also protected.
Communication Cost. The server and clients transmit homogeneous small models with fewer parameters than the client’s heterogeneous local model, consuming significantly lower communication costs in one communication round compared with transmitting complete local models like FedAvg.
Computational Overhead. Except for training the client’s heterogeneous local model, each client also trains the homogeneous global small model and a lightweight representation projector which have far fewer parameters than the heterogeneous local model. The computational overhead in one training round is slightly increased. Since we design personalized Matryoshka Representations learning adapting to local data distribution from multiple perspectives, the model learning capability is improved, accelerating model convergence and consuming fewer training rounds. Therefore, the total computational cost may also be reduced.
Appendix E Broader Impacts and Limitations
Broader Impacts. FedMRL improves model performance, communication and computational efficiency for heterogeneous federated learning while effectively protecting the privacy of the client heterogeneous local model and non-IID data. It can be applied in various practical FL applications.
Limitations. The multi-granularity embedded representations within Matryoshka Representations are processed by the global small model’s header and the local client model’s header, respectively. This increases the storage cost, communication costs and training overhead for the global header even though it only involves one linear layer. In future work, we will follow the more effective Matryoshka Representation learning method (MRL-E) [21], removing the global header and only using the local model header to process multi-granularity Matryoshka Representations simultaneously, to enable a better trade-off among model performance and costs of storage, communication and computation.