EN KR JP CN RU IN
NeurIPS

Federated Model Heterogeneous Matryoshka Representation Learning

Open PDF Open MICCAI page

Background & Academic Lineage

The Origin & Academic Lineage

The problem addressed in this paper, model heterogeneous federated learning (MHeteroFL), emerged from the practical challenges encountered in traditional federated learning (FL). Traditional FL, as introduced by works like [32, 47, 46, 12], typically involves a central server coordinating multiple clients to train a single, global shared model without exposing their local data. This setup, while preserving data privacy by only transmitting model parameters [14, 56, 51], struggles with several forms of heterogeneity common in real-world applications.

Specifically, three fundamental "pain points" forced the development of MHeteroFL and subsequently, this paper:

  1. Data Heterogeneity (Non-IID Data): Clients' local data often do not follow an independent and identically distributed (non-IID) pattern [42]. This means a single global model, trained by aggregating local models, might perform poorly on individual clients due to the diverse nature of their data [49, 48].
  2. System Heterogeneity: FL clients can have vastly different computing power and network bandwidth [11]. Forcing all clients to train the same model structure means the global model size must be constrained by the weakest device, leading to sub-optimal performance on more powerful clients [52, 54, 50].
  3. Model Heterogeneity: Enterprises acting as FL clients often possess proprietary models with heterogeneous structures that cannot be directly shared due to intellectual property (IP) concerns [43].

The field of MHeteroFL [55] arose to enable FL clients to train local models with tailored structures suited to their specific system resources and data distributions. However, existing MHeteroFL methods [41, 45] face their own limitations. They primarily rely on training loss to transfer knowledge between client and server models, which results in limited knowledge exchange, model performance bottlenecks, high communication and computation costs, and the risk of exposing private local model structures and data. For instance, methods using adaptive subnets struggle to aggregate black-box local models; knowledge distillation often requires hard-to-find public datasets or incurs high training costs; model splitting can expose proprietary IP; and mutual learning, while promising, only transfers limited knowledge, leading to performance bottlenecks. This paper aims to overcome these limitations by proposing a novel approach that enhances knowledge transfer and improves model learning capability in a more efficient and private manner.

Intuitive Domain Terms

Here are a few specialized domain terms from the paper, translated into intuitive, everyday analogies for a zero-base reader:

  • Federated Learning (FL): Imagine a group of students from different schools trying to learn a new subject together, but they can't share their personal notes (local data) directly due to privacy rules. Instead, each student studies using their own notes and then sends a summary of what they've learned (model updates) to a central teacher. The teacher combines all the summaries to create a better, more comprehensive lesson plan (global model), which is then shared back to the students. This way, everyone learns from collective experience without anyone's private notes ever leaving their school.
  • Model Heterogeneity: Think of a team of specialized doctors, each with their own unique diagnostic tools and expertise (models) for different types of patients. Model heterogeneity means that these tools and expertise are not identical across all doctors. Some might have advanced MRI machines, others might specialize in X-rays, and they all have different ways of interpreting results. The challenge is how they can collaborativly improve their overall diagnostic capabilities without sharing their proprietary tools or methods directly.
  • Matryoshka Representation Learning (MRL): This is like a set of Russian nesting dolls. Each doll represents a different level of detail or "understanding" about a piece of information. The largest doll gives a broad, general overview, while the smaller dolls nested inside provide progressively finer and more specific details. MRL allows a machine learning model to extract these multi-layered "understandings" from data, so it can choose the appropriate level of detail needed for a task, balancing accuracy with efficiency.
  • Non-IID Data (Non-Independent and Identically Distributed Data): Consider a global food delivery service trying to predict popular dishes. If all their customers lived in a single city, their data on food preferences would likely be "IID" (everyone might order pizza or burgers). However, if customers are spread across diverse countries, their preferences will be "non-IID" – some might prefer sushi, others tacos, and others curry. This means the data is not uniformly distributed, and a model trained on it needs to be flexible enough to handle these diverse local tastes rather than assuming a single global preference.

Notation Table

Notation Description

Problem Definition & Constraints

Core Problem Formulation & The Dilemma

The fundamental problem addressed by this paper lies within the domain of Model Heterogeneous Federated Learning (MHeteroFL).

Input/Current State:
In traditional Federated Learning (FL), a central server coordinates multiple clients to collaboratively train a single, global shared model. Clients train this model on their local data and send updated parameters to the server for aggregation. However, this paradigm faces significant challenges when clients possess heterogeneous local models, diverse system resources, and non-Independent and Identically Distributed (non-IID) local data. Existing MHeteroFL approaches attempt to address model heterogeneity by allowing clients to train models with tailored structures. The current state of these methods primarily relies on transferring knowledge between client and server models through training loss.

Desired Endpoint (Output/Goal State):
The paper aims to develop a novel MHeteroFL approach, termed Federated model heterogeneous Matryoshka Representation Learning (FedMRL), for supervised learning tasks. The desired outcome is a system that can effectively facilitate knowledge transfer between heterogeneous client models and a homogeneous global model, leading to superior model accuracy, faster convergence, and reduced communication and computational costs, all while strictly preserving data privacy and accommodating diverse client model structures and data distributions. The ultimate goal is for each client to use its local combined model for inference after FL training.

Missing Link or Mathematical Gap:
The critical missing link in existing MHeteroFL methods is their limited capability for knowledge exchange. Relying solely on training loss for knowledge transfer often leads to performance bottlenecks, high communication and computation costs, and risks exposing private local model structures and sensitive local data. The paper attempts to bridge this gap by introducing two key innovations:
1. Adaptive Representation Fusion: Instead of just loss, FedMRL fuses generalized representations (extracted by the global homogeneous model's feature extractor) and personalized representations (extracted by the client's heterogeneous local model's feature extractor). These are then mapped to a unified, fused representation by a personalized lightweight representation projector, adapting to local non-IID data.
2. Multi-Granularity Representation Learning: The fused representation is used to construct Matryoshka Representations, which involve multi-dimensional and multi-granular embedded representations. These are processed by both the global homogeneous model header and the local heterogeneous model header, with their combined losses used to update all models. This multi-perspective learning enhances knowledge interaction.

Mathematically, the paper seeks to minimize the following objective function across all clients:
$$ \min_{\theta, \omega_0, \dots, \omega_{N-1}, \phi_0, \dots, \phi_{N-1}} \sum_{k=0}^{N-1} l(W_k(D_k; (\theta \circ \omega_k | \phi_k))) $$
where $W_k$ represents the combined model for client $k$, $D_k$ is client $k$'s local data, $\theta$ denotes the parameters of the global homogeneous small model, $\omega_k$ represents the parameters of client $k$'s local heterogeneous model, and $\phi_k$ represents the parameters of client $k$'s personalized representation projector. This objective is optimized via gradient descent for all these parameter sets.

The Dilemma:
The core dilemma that has trapped previous researchers is the painful trade-off between effective knowledge transfer and model performance versus privacy preservation, communication efficiency, and computational feasibility in heterogeneous FL environments. Improving knowledge transfer often necessitates sharing more information (e.g., intermediate features, model structures), which can compromise privacy, increase communication bandwidth requirements, and demand more computational resources. Conversely, strict privacy and resource constraints limit the depth and richness of knowledge that can be exchanged, leading to suboptimal model performance, especially when dealing with highly diverse client models and data. The challenge is to achieve robust knowledge sharing without breaking these critical constraints.

Constraints & Failure Modes

The problem of model heterogeneous federated learning is insanely difficult due to several harsh, realistic walls that authors hit:

  1. Data Heterogeneity (Non-IID Data): Clients' local datasets are often non-IID, meaning their data distributions are different. A global model trained by aggregating updates from such diverse local data may perform poorly on individual clients or generalize poorly across the network. This makes achieving a universally performant model extremely challenging.
  2. System Heterogeneity: FL clients possess diverse computational capabilities (e.g., CPU/GPU, memory) and network bandwidth. A solution must be adaptable to these varying resources. Forcing a large, uniform model structure on all clients means the model size must accommodate the weakest device, leading to underutilization of resources on more powerful clients and suboptimal performance.
  3. Model Heterogeneity & Intellectual Property (IP) Concerns: Clients, particularly enterprises, may have proprietary local models with distinct architectures and parameters that cannot be directly shared with others due to IP protection. This constraint prevents direct model parameter averaging, a common operation in traditional FL.
  4. Limited Knowledge Transfer Mechanisms: Existing MHeteroFL methods primarily rely on training loss for knowledge transfer, which is often insufficient for robust learning across highly heterogeneous models. This limited knowledge exchange leads to performance bottlenecks and slower convergence.
  5. Communication Cost Limits: In FL, only model parameters are transmitted between the server and clients, not raw data, to preserve privacy. However, even model parameters can be large. Solutions must incur low communication costs per round and achieve target accuracy in fewer rounds to be practical, especially for edge devices with limited bandwidth.
  6. Computational Overhead Limits: Clients, especially mobile or edge devices, have limited computational resources. Any additional components or training steps introduced by a MHeteroFL solution must incur low extra computational costs per client per round to be feasible.
  7. Privacy Preservation Requirements: A core tenet of FL is that local data remains on client devices. Furthermore, the client's local model structures and parameters should not be exposed to the server or other clients. Any knowledge transfer mechanism must uphold these strict privacy guarantees.
  8. Non-convex Optimization: The objective function for federated learning, especially with heterogeneous models and complex representation learning, is typically non-convex. Guaranteeing convergence and achieving good local optima is a significant mathematical challenge, requiring careful design of optimization strategies and theoretical analysis. The paper provides a theoretical analysis for an $O(1/T)$ non-convex convergence rate.
  9. Model Agnostic Client Onboarding: The system should be flexible enough to allow new clients with diverse, potentially unknown, local model structures to join the federated learning process seamlessly. This requires adaptive mechanisms that do not assume prior knowledge of client model architectures.

Why This Approach

The Inevitability of the Choice

The adoption of Federated Model Heterogeneous Matryoshka Representation Learning (FedMRL) was not merely a preference but a necessary evolution driven by the inherent limitations of prior approaches in model heterogeneous federated learning (MHeteroFL). The authors recognized that traditional "SOTA" methods, even when adapted for federated settings, were fundamentally insufficient to simultaneously address the multifaceted challenges of data, system, and model heterogeneity while maintaining privacy and efficiency.

Specifically, the paper highlights that existing MHeteroFL methods primarily rely on training loss to transfer knowledge between client and server models. This design choice proved to be a bottleneck, leading to limited knowledge exchange, high communication and computation costs, and an unacceptable risk of exposing private local model structures and data. The realization that these methods were inadequate stemmed from their inability to:
1. Effectively transfer rich knowledge: Simple loss-based knowledge transfer proved insufficient for complex heterogeneous model structures and diverse local data distributions.
2. Manage high communication and computation overhead: Transmitting entire model parameters or relying on computationally expensive distillation techniques was unsustainable.
3. Preserve privacy of proprietary models: Many existing methods required exposing parts of the local model structure, which is a non-starter for enterprise clients concerned with intellectual property.

The inspiration from Matryoshka Representation Learning (MRL) [24] provided the critical insight: tailoring representation dimensions to achieve an optimal trade-off between model perfomance and inference costs. This concept, when integrated into MHeteroFL, offered a pathway to overcome the aforementioned limitations, making FedMRL the only viable solution that could robustly handle the complexities of real-world heterogeneous federated environments.

Comparative Superiority

FedMRL demonstrates qualitative superiority over previous gold standards through several structural advantages that go beyond mere performance metrics. While it achieves significant accuracy improvements (up to 8.48% over the best baseline and 24.94% over the best same-category baseline), its true strength lies in its design innovations:

  1. Adaptive Representation Fusion: Unlike methods that rely on fixed knowledge transfer mechanisms, FedMRL introduces a personalized lightweight representation projector. This projector dynamically adapts to local non-IID data distributions, fusing generlized representations from the global homogeneous model with personalized representations from the local heterogeneous model. This adaptive fusion ensures that knowledge transfer is highly relevant and effective for each client's unique data, a structural advantage that significantly enhances model learning capability in diverse data environments.
  2. Multi-Granularity Representation Learning: Inspired by MRL, FedMRL constructs Matryoshka Representations with multi-dimensional and multi-granular embedded representations. This allows for multi-perspective representation learning, meaning the model can capture both coarse and fine-grained features. This structural depth enables a richer and more robust understanding of data, making the model more resilient to variations and noise inherent in heterogeneous federated settings. It's not about handling high-dimensional noise better in the traditional sense, but rather extracting more informative and adaptable representations across different granularities.
  3. Optimized Resource Trade-offs: The ability to vary the representation dimension ($d_1$) of the small homogeneous global model relative to the local model's dimension ($d_2$) provides a crucial knob for optimizing the trade-off between model performance, storage requirements, and communcation costs. This flexibility is a significant structural advantage, allowing the system to be tailored to diverse client capabilities without sacrificing overall effectiveness. For instance, a smaller $d_1$ can drastically reduce communication overhead without a proportional drop in accuracy, as shown in the sensitivity analysis (Figure 6, left two).

These innovations collectively provide a structural advantage that allows FedMRL to achieve superior model accuracy with lower communication and computational costs, while also offering stronger personalization capabilities for individual clients, as evidenced by the individual client test accuracy differences (Figure 3, right two).

Alignment with Constraints

FedMRL's design perfectly aligns with the harsh requirements of model heterogeneous federated learning, forming a "marriage" between problem and solution:

  • Data Heterogeneity (non-IID data): The Adaptive Representation Fusion mechanism, with its personalized representation projector, is explicitly designed to adapt to local non-IID data distributions. By fusing generalized and personalized features in a data-aware manner, FedMRL directly tackles the challenge of clients having statistically different datasets.
  • System Heterogeneity: The introduction of an auxiliary small homogeneous model that interacts with heterogeneous local models is key. The global model's size can be kept small (by varying $d_1$), accommodating clients with limited computing power or network bandwidth. Clients can also tailor their local models to their specific system resources, as the framework is model-agnostic for the local heterogeneous model.
  • Model Heterogeneity: FedMRL treats each client's local model as a "black box." The server only broadcasts and aggregates the small homogeneous model, not the heterogeneous local models. This ensures that proprietary model structures of clients are never exposed, directly addressing the intellectual property concerns.
  • Privacy Preservation: This is a direct consequence of the model heterogeneity solution. Since only the small homogeneous model parameters are exchanged, local data and the full structure of client-specific heterogeneous models remain private on the client side.
  • Communication and Computation Costs: By exchanging only the small homogeneous model, FedMRL significantly reduces the number of parameters transmitted per round compared to methods that exchange full local models. Furthermore, the enhanced knowledge transfer through adaptive fusion and multi-granularity learning leads to faster model convergence (fewer communication rounds overall), which ultimately reduces the total communication and computational overhead, despite a slight increase in per-round computation due to the auxiliary model.

Rejection of Alternatives

The paper implicitly and explicitly rejects several alternative MHeteroFL approaches by highlighting their fundamental shortcomings that FedMRL aims to overcome.

  • MHeteroFL with Adaptive Subnets: These methods construct local subnets by pruning or designing global model parameters. The paper notes their failure when clients possess "black-box local models with heterogeneous structures not derived from a common global model," as the server cannot aggregate them. This limitation is critical for scenarios where clients have truly proprietary and diverse model architectures, which FedMRL accommodates by treating local models as black boxes.
  • MHeteroFL with Knowledge Distillation: While popular, these methods often "rely on a public dataset with the same data distribution as the learning task." The authors point out that "in practice, such a suitable public dataset can be hard to find." Alternatives involving training a generator to synthesize shared data are dismissed due to "high training costs." FedMRL avoids these issues by directly fusing representations without needing a public dataset or expensive data generation.
  • MHeteroFL with Model Split: Approaches that split models into feature extractors and predictors (e.g., sharing homogeneous feature extractors or personalized predictors) are rejected because they "expose part of the local model structures," which is "not acceptable if the models are proprietary IPs of the clients." FedMRL's design ensures local model structures remain entirely private.
  • MHeteroFL with Mutual Learning: FedMRL is presented as an optimization of this category. Existing mutual learning methods (like FML [41] or FedKD [45]) "add a shared global homogeneous small model on top of each client's heterogeneous local model" and use mutual loss for updates. However, the paper states that "the mutual loss only transfers limited knowledge between the two models, resulting in model performance bottlenecks." FedMRL addresses this by enhancing knowledge transfer through adaptive representation fusion and multi-granularity learning, thereby overcoming the core limitation of its closest predecessors.

The paper does not discuss generative models like GANs or Diffusion models as direct alternatives, as their primary function (generating data) is distinct from the representation learning and classification task at hand in MHeteroFL. The focus is on improving knowledge transfer and handling heterogeneity within a discriminative federated learning context.

Figure 7. Accuracy of four optional inference models: mix-small (the whole model without the local header), mix-large (the whole model without the global header), single-small (the homogeneous small model), single-large (the client heterogeneous model)

Mathematical & Logical Mechanism

The Master Equation

The absolute core equation that drives the Federated Model Heterogeneous Matryoshka Representation Learning (FedMRL) approach is its objective function, which aims to minimize the total loss across all participating clients. This master equation, found in Section 3, is presented as:

$$ \min_{\theta, \omega_0, \dots, \omega_{N-1}, \phi_0, \dots, \phi_{N-1}} \sum_{k=0}^{N-1} l(W_k(D_k; (\theta \circ \omega_k | \phi_k))) $$

Term-by-Term Autopsy

Let's dissect this equation piece by piece to understand its full meaning and role within the FedMRL framework.

  • $\min_{\theta, \omega_0, \dots, \omega_{N-1}, \phi_0, \dots, \phi_{N-1}}$: This is the minimization operator.

    • Mathematical Definition: It indicates that the goal is to find the specific values for the parameters $\theta$, $\omega_k$ (for all $k$), and $\phi_k$ (for all $k$) that result in the smallest possible value of the objective function (the sum of losses).
    • Physical/Logical Role: This is the very heart of the learning process. It signifies that the system is trying to find the "best" set of models and projectors that minimize prediction errors across the entire federated network.
    • Why Used: Minimization is a fundamental concept in machine learning, as models are typically trained by reducing a defined error metric.
  • $\sum_{k=0}^{N-1}$: This denotes a summation over all $N$ clients.

    • Mathematical Definition: It sums up the loss contributions from each individual client, from client 0 to client $N-1$.
    • Physical/Logical Role: In a federated learning setting, the overall performance is a collective measure. This summation ensures that the global optimization objective considers the performance and contributions of every single client, fostering collaborative learning.
    • Why Used: To aggregate the local learning objectives into a single global objective, reflecting the distributed nature of federated learning where no single client's loss is optimized in isolation.
  • $l(\cdot)$: This represents the loss function.

    • Mathematical Definition: A mathematical function that quantifies the difference or error between the model's predicted output and the actual true label. The paper mentions cross-entropy loss [63] as a typical example.
    • Physical/Logical Role: It acts as a feedback mechanism, telling the model how "wrong" its predictions are. A higher loss means worse performance, prompting the model to adjust its parameters during training.
    • Why Used: Cross-entropy loss is a standard and effective choice for classification tasks, which is the primary application context for FedMRL in this paper.
  • $W_k(\cdot)$: This is the combined model for client $k$.

    • Mathematical Definition: It's a composite function representing the entire processing pipeline for client $k$. As described in the paper, $W_k(\omega_k) = (G(\theta) \circ F_k(\omega_k) | P_k(\phi_k))$, implying a fusion of components.
    • Physical/Logical Role: This is the actual "engine" at client $k$ that takes raw data, processes it through both shared global and local personalized components, and ultimately generates a prediction. Its output is what the loss function evaluates.
    • Why Used: It encapsulates the unique architecture of FedMRL, where each client's prediction is a result of interacting global, local, and personalized fusion mechanisms.
  • $D_k$: This refers to the local non-IID data for client $k$.

    • Mathematical Definition: A dataset containing input-label pairs $(x_i, y_i)$ that are exclusively available to client $k$. This data is often non-independent and identically distributed (non-IID), meaning its statistical properties can differ significantly from other clients' data.
    • Physical/Logical Role: This is the private, local information that client $k$ uses to train its model. It reflects the real-world challenge of data heterogeneity in federated learning.
    • Why Used: Federated learning's core principle is to train models on decentralized data without sharing the raw data itself. Thus, each client's objective is evaluated on its local data.
  • $(\theta \circ \omega_k | \phi_k)$: This represents the collective parameters that define the behavior of client $k$'s combined model $W_k$.

    • Mathematical Definition: It's a conceptual grouping of the global model parameters $\theta$, client $k$'s local model parameters $\omega_k$, and client $k$'s personalized representation projector parameters $\phi_k$. The $\circ$ symbol typically denotes function composition (e.g., feature extractors), while the $|$ symbol here indicates the inclusion of the projector parameters in the overall model definition.
    • Physical/Logical Role: These are the knobs and dials that the optimization process adjusts. $\theta$ governs the shared knowledge, $\omega_k$ handles client-specific personalization, and $\phi_k$ fine-tunes how these two types of knowledge are blended for client $k$.
    • Why Used: To explicitly denote all the trainable parameters that contribute to the prediction for client $k$ and are subject to the minimization process.

Step-by-Step Flow

Imagine a single data point, say an image $x_i$ with its true label $y_i$, entering client $k$'s system. Here's how it moves through the FedMRL mechanism:

  1. Dual Feature Extraction: First, the input image $x_i$ is simultaneously fed into two distinct feature extractors.

    • It goes into the global homogeneous model's feature extractor, $G^{ex}$, which is a component of the shared global model. This extracts a generalized representation $R_k^g$. Think of this as capturing common, broadly applicable features.
    • At the same time, $x_i$ enters client $k$'s local heterogeneous model's feature extractor, $F_k^{ex}$. This extracts a personalized representation $R_k^f$, which is tailored to client $k$'s specific data characteristics and model structure. This is like getting a specialized view.
  2. Representation Splicing: Next, these two distinct representations, $R_k^g$ and $R_k^f$, are "spliced" together. This is typically a concatenation operation, forming a longer combined representation $R_i$. This step is crucial because it preserves the individual semantic information from both the generalized and personalized views before further processing.

  3. Adaptive Representation Fusion: The spliced representation $R_i$ then passes through client $k$'s personalized lightweight representation projector, $P_k$. This projector maps the spliced representation into a fused representation $\tilde{R}_i$. This projector is adaptive, meaning it learns how best to combine the generalized and personalized features specifically for client $k$'s local data distribution, acting like a smart mixer.

  4. Matryoshka Representation Construction: From this single fused representation $\tilde{R}_i$, two "Matryoshka" (nested) representations are derived.

    • A low-dimension coarse-granularity representation $R_i^{lc}$ is extracted. This is like taking a broad, summary view of the fused features.
    • A high-dimension fine-granularity representation $R_i^{hf}$ is also extracted. This captures more detailed aspects of the fused features, potentially encompassing the coarse view.
  5. Dual Prediction Headers: These two Matryoshka representations are then sent to their respective prediction heads:

    • $R_i^{lc}$ goes to the global homogeneous model's prediction header, $G^{hd}$, which makes a coarse prediction $\hat{y}_i^{lc}$.
    • $R_i^{hf}$ goes to client $k$'s local heterogeneous model's prediction header, $F_k^{hd}$, which makes a fine prediction $\hat{y}_i^{F_k}$.
  6. Loss Calculation and Summation: Finally, the system calculates the individual losses for both predictions against the true label $y_i$. These are $l_i^{lc}$ and $l_i^{F_k}$. These two losses are then weighted (by default, equally) and summed up to produce a single total loss $l_i$ for the input data point. This total loss is the ultimate signal that guides the learning process.

Optimization Dynamics

The FedMRL mechanism learns and converges through an iterative process that combines local client-side training with server-side aggregation. It's a dance between personalization and generalization.

  1. Local Learning and Gradient Descent: In each communication round, a subset of clients is selected. Each selected client $k$ receives the current global homogeneous model parameters ($\theta$) from the server. Then, for multiple local training epochs, client $k$ processes its private local data $D_k$. For every data point $(x_i, y_i)$, the "Step-by-Step Flow" described above is executed to compute the total loss $l_i$. This loss is then used to calculate gradients for all the parameters involved in client $k$'s combined model: the global model parameters ($\theta$), its local heterogeneous model parameters ($\omega_k$), and its personalized representation projector parameters ($\phi_k$). These parameters are updated using gradient descent:
    $$ \theta^t \leftarrow \theta^{t-1} - \eta_\theta \nabla l_i \\ \omega_k^t \leftarrow \omega_k^{t-1} - \eta_\omega \nabla l_i \\ \phi_k^t \leftarrow \phi_k^{t-1} - \eta_\phi \nabla l_i $$
    The learning rates $\eta_\theta, \eta_\omega, \eta_\phi$ control the step size of these updates. The paper mentions setting them equal by default to ensure stable convergence, which is a clever trick. This local training allows each client to adapt the shared global knowledge and personalize its local model and projector to its unique data.

  2. Selective Parameter Upload: After completing its local training epochs, client $k$ only uploads its updated global homogeneous small model parameters ($\theta^t$) back to the central server. Critically, the client's local heterogeneous model parameters ($\omega_k$) and the personalized projector parameters ($\phi_k$) remain on the client, ensuring data privacy and reducing communication overhead. This selective sharing is a key design choice.

  3. Server-Side Aggregation: The central server collects the updated global homogeneous model parameters from all participating clients. It then aggregates these parameters, typically by averaging them (similar to Federated Averaging), to produce a new, improved global homogeneous model $\theta^{t+1}$. This aggregation step synthesizes the shared knowledge learned across all clients.

  4. Global Model Broadcast: The newly aggregated global model $\theta^{t+1}$ is then broadcast back to all clients for the next communication round. This completes one full cycle of federated learning.

  5. Convergence Behavior: This iterative process continues until the models converge. The paper provides a theoretical analysis demonstrating an $O(1/T)$ non-convex convergence rate, where $T$ is the number of communication rounds. This means that as more rounds of training occur, the overall loss is expected to decrease, and the model's performance improves. The loss landscape is shaped by the complex interplay of generalized and personalized representations. The multi-granularity Matryoshka representations help the model explore this landscape from different perspectives, facilitating better learning and convergence by allowing for both coarse and fine-grained adjustments. The adaptive representation fusion further refines this by tailoring the knowledge blend to each client's specific data, making the optimization more robust to data heterogeneity.

Figure 2. The workflow of FedMRL

Results, Limitations & Conclusion

Experimental Design & Baselines

To rigorously validate FedMRL's mathematical claims and practical efficacy, the authors architected a comprehensive experimental setup. They implemented FedMRL using Pytorch and benchmarked it against seven state-of-the-art Model Heterogeneous Federated Learning (MHeteroFL) methods. All experiments were conducted on a powerful hardware setup, utilizing four NVIDIA GeForce 3090 GPUs, each with 24GB of memory.

The "victims" (baseline models) against which FedMRL was pitted fell into four distinct categories of MHeteroFL approaches:
1. Standalone: Each client trains its model in isolation, representing the lower bound of collaborative learning benefits.
2. Knowledge Distillation Without Public Data: This category included FD [21] and FedProto [43], which transfer knowledge by sharing intermediate information or prototypes without relying on a public dataset.
3. Model Split: Represented by LG-FedAvg [27], these methods split models into feature extractors and predictors, sharing some components while personalizing others.
4. Mutual Learning: This group comprised FML [41], FedKD [45], and FedAPEN [37], which typically add a shared global homogeneous small model and use mutual loss to update parameters. FedMRL directly builds upon and aims to improve this category.

Two widely-used benchmark datasets for image classification in FL were employed: CIFAR-10 (10 classes) and CIFAR-100 (100 classes), both consisting of 60,000 32x32 color images. To simulate real-world data heterogeneity, two types of non-IID (non-independent and identically distributed) data partitions were constructed:
- Non-IID (Class): Clients were assigned a limited number of classes (e.g., 2 for CIFAR-10, 10 for CIFAR-100), with fewer classes indicating higher non-IIDness.
- Non-IID (Dirichlet): A Dirichlet($\alpha$) distribution was used to control the data distribution skew, where a smaller $\alpha$ value signified more pronounced non-IIDness.

The evaluation covered both model-homogeneous (all clients use CNN-1) and model-heterogeneous (clients use a mix of CNN-1 to CNN-5 models) FL scenarios. FedMRL's core mechanism, involving an auxiliary homogeneous small model and a personalized representation projector, was tested with its unique hyperparameter $d_1$ (representation dimension of the homogeneous small model) varied from 100 to 500 to find optimal performance. The authors meticulously searched for optimal FL hyperparameters across all algorithms, including batch size, number of local epochs, communication rounds, and learning rates, to ensure a fair comparison.

The primary evaluation metrics were:
- Model Accuracy: The average test accuracy across all clients' models.
- Communication Cost: Measured by the total number of parameters exchanged between server and client to reach a target accuracy, considering both parameters per round and the number of rounds.
- Computation Overhead: Measured by the total FLOPs (floating-point operations) performed by a client to reach a target accuracy, accounting for FLOPs per round and the number of rounds.

What the Evidence Proves

The experimental evidence provides definitive and undeniable proof that FedMRL's core mechanism—adaptive personalized representation fusion and multi-granularity representation learning—significantly enhances performance in heterogeneous federated learning environments.

Superior Accuracy:
- Overall Outperformance: Across all tested FL settings (varying client numbers N and participation rates C) and both model-homogeneous (Appendix C.2, Table 3) and model-heterogeneous (Table 1) scenarios, FedMRL consistently achieved higher average test accuracy than all baselines.
- Quantifiable Gains: FedMRL demonstrated an impressive improvement of up to 8.48% in average test accuracy compared to the overall best-performing baseline. More strikingly, it achieved up to a 24.94% improvement over the best baseline within its own category (mutual learning-based MHeteroFL methods). This substantial margin clearly indicates that FedMRL's approach to knowledge transfer is far more effective than previous mutual learning strategies that relied solely on training loss.
- Faster Convergence: Figure 3 (left six plots) visually confirms that FedMRL not only reaches higher accuracy but also converges faster than the best baseline (FedProto), indicating more efficient learning.

Enhanced Personalization:
- Individual Client Benefits: Figure 3 (right two plots) provides compelling evidence of FedMRL's strong personalization capability. When compared to FedProto, FedMRL enabled 87% of clients on CIFAR-10 and a remarkable 99% of clients on CIFAR-100 to achieve better individual test accuracy. This directly validates the effectiveness of the personalized representation projector and multi-granularity learning in adapting to diverse local data distributions and model structures.

Improved Efficiency:
- Reduced Communication Rounds: Figure 4 (left) shows that FedMRL requires fewer communication rounds to reach target accuracy levels (90% for CIFAR-10, 50% for CIFAR-100) compared to FedProto, implying faster overall training.
- Lower Total Computation: Despite the per-round overhead of training an additional small homogeneous model and a lightweight projector, Figure 4 (right) demonstrates that FedMRL incurs lower total computation costs than FedProto. This is a direct consequence of its faster convergence, which outweighs the slightly increased per-round computational burden.
- Communication Cost Trade-off: While FedMRL's communication cost per round is higher than FedProto (due to transmitting the full homogeneous small model), the paper argues that with an optional smaller representation dimension ($d_1$), it still achieves higher communication efficiency than other mutual learning-based MHeteroFL baselines that use larger representation dimensions. This suggests a strategic trade-off that can be optimized.

Robustness to Heterogeneity:
- Non-IID Data Robustness: The case studies (Figure 5) unequivocally demonstrate FedMRL's robustness to various degrees of non-IIDness, both class-based and Dirichlet-based. FedMRL consistently maintained higher average test accuracy than FedProto across all non-IID settings, proving its ability to handle diverse data distributions effectively.

Ablation Study Validation:
- Matryoshka Representation Learning's Impact: The ablation study (Figure 6, right two plots) provides critical evidence for the utility of the Matryoshka Representation Learning (MRL) component. FedMRL with MRL consistently outperformed FedMRL without MRL, confirming that the multi-granularity representation learning design is a vital contributor to the overall performance gains in MHeteroFL. The observation that the accuracy gap diminishes as $d_1$ rises also offers insight into the mechanism, suggesting that the benefits of MRL are most pronounced when representations are less overlapping.

In essence, the evidence proves that FedMRL's dual innovations—adaptive representation fusion and multi-granularity representation learning—work in concert to deliver a powerful, efficient, and robust solution for model-heterogeneous federated learning, decisively defeating state-of-the-art baselines across multiple critical metrics.

Limitations & Future Directions

While FedMRL presents a significant advancement in model-heterogeneous federated learning, the authors candidly acknowledge certain limitations and propose clear avenues for future research.

Current Limitations:
1. Increased Resource Consumption for Global Header: The current design involves processing multi-granularity embedded representations through both the global small model's header and the local client model's header. Although the global header is a relatively simple linear layer, this dual processing inherently increases the storage cost, communication costs, and training overhead associated with the global header. This is a practical concern, especially in resource-constrained FL environments where every byte and FLOP counts.
2. Lack of Statistical Significance Reporting: The paper mentions conducting only three trials for each experimental setting and reporting average results. This approach, while common, does not include error bars, confidence intervals, or statistical significance tests. Consequently, it's difficult to ascertain the statistical robustness of the reported improvements and whether the observed differences are truly significant or merely due to random variation across runs. This is a minor but important omission for full scientific rigor.

Future Directions and Discussion Topics:

The identified limitations naturally lead to several promising directions for further development and evolution of these findings, stimulating critical thinking:

  1. Optimizing Global Header Usage (MRL-E Integration): The authors explicitly suggest adopting the more effective Matryoshka Representation Learning method (MRL-E) [24] in future work. This involves removing the global header entirely and relying solely on the local model header to process multi-granularity Matryoshka Representations. This would directly address the current limitation of increased resource consumption for the global header, potentially leading to a better trade-off between model performance and the costs of storage, communication, and computation. A key discussion point here is how to ensure sufficient knowledge transfer and generalization capability from the homogeneous model if its header is completely removed. Would this necessitate a more sophisticated fusion mechanism or a different aggregation strategy for the homogeneous model's feature extractor?

  2. Dynamic Representation Dimension Adaptation: The sensitivity analysis on $d_1$ (the representation dimension of the homogeneous small model) showed that smaller $d_1$ values often lead to higher accuracy and reduced overheads. This suggests that $d_1$ is a crucial hyperparameter for balancing performance and efficiency. Future work could explore dynamic, adaptive mechanisms to determine $d_1$ (and potentially $d_2$) during training, perhaps based on client-specific resource constraints or data characteristics. Could an online learning approach or a meta-learning framework be used to optimize these dimensions without manual tuning?

  3. Beyond Supervised Learning: The current FedMRL approach is tailored for supervised learning tasks. Extending it to other learning paradigms, such as semi-supervised, unsupervised, or reinforcement learning in a federated heterogeneous setting, would be a significant step. How would the concepts of adaptive representation fusion and multi-granularity learning translate to scenarios where labels are scarce or where the objective function is not a simple cross-entropy loss?

  4. Robustness to Adversarial Attacks and Data Poisoning: While FedMRL addresses data and model heterogeneity, its robustness against adversarial attacks or data poisoning (a common concern in FL) is not explicitly evaluated. Future research could investigate how the multi-granularity representations and personalized projectors might inherently offer some resilience or how they could be augmented with specific defense mechanisms.

  5. Scalability to Extremely Large-Scale FL: The experiments were conducted with up to 100 clients. While this is a good start, real-world FL deployments can involve millions of devices. Investigating FedMRL's scalability to orders of magnitude more clients, especially concerning communication overheads and aggregation strategies, would be crucial. Are there bottlenecks in the current aggregation scheme that would become prohibitive at massive scales?

  6. Formal Statistical Significance: To bolster the scientific rigor, future work should incorporate formal statistical significance testing, including error bars and confidence intervals, for all experimental results. This would provide a clearer understanding of the reliability and generalizability of the observed performance gains.

  7. Exploration of Alternative Projector Architectures: The paper mentions that the personalized representation projector can be a one-layer linear model or a multi-layer perceptron. The current experiments likely use a simple linear model. Exploring more complex or adaptive projector architectures, perhaps ones that can dynamically adjust their complexity based on local data, could further enhance personalization and knowledge fusion.

By addressing these limitations and exploring these forward-looking directions, the foundational work of FedMRL can be further refined and expanded, paving the way for even more robust and efficient heterogeneous federated learning systems.

Table 1. and Table 3 (Appendix C.2) show that FedMRL consistently outperforms all baselines under both model-heterogeneous or homogeneous settings. It achieves up to a 8.48% improvement in average test accuracy compared with the best baseline under each setting. Furthermore, it achieves up to a 24.94% average test accuracy improvement than the best same-category (i.e., mutual learning- based MHeteroFL) baseline under each setting. These results demonstrate the superiority of FedMRL Table 3. presents the results of FedMRL and baselines in model-homogeneous FL scenarios Table 2. shows the structures of models used in experiments

Isomorphisms with other fields

Structural Skeleton

This paper presents a mechanism for collaborativly learning from diverse, distributed models by fusing their representations into a shared, multi-granular structure, adapting to local data distributions, and enabling efficient knowledge transfer.

Distant Cousins

  1. Target Field: Systems Biology / Multi-omics Integration

    • The Connection: In systems biology, researchers frequently encounter the challenge of integrating heterogeneous data types (e.g., genomics, proteomics, metabolomics – often termed "multi-omics") collected from various sources, such as different research labs or patient cohorts (distributed clients). These datasets inherently possess diverse structures, scales, and underlying biological contexts (heterogeneous local models). The long-standing problem is to synthesize these disparate information streams into a unified, comprehensive representation that can reveal complex biological mechanisms or predict disease outcomes. This paper's core logic, which involves fusing heterogeneous representations into a multi-granular structure, mirrors the need to integrate multi-omics data to uncover nested, hierarchical biological insights (e.g., how genetic variations influence protein expression, which in turn affects metabolic pathways). The "personalized representation projector" could be seen as an analogous component that adapts the integration process to account for patient-specific or tissue-specific biological variations and data biases, much like how FedMRL adapts to local non-IID data.
  2. Target Field: Urban Planning / Smart City Data Fusion

    • The Connection: Modern urban planning and smart city initiatives rely on integrating vast amounts of heterogeneous data from numerous sensors and systems across a city. This includes traffic flow data, public transportation usage, environmental sensor readings (air quality, noise levels), social media activity, utility consumption, and demographic information. These data sources are often managed by different municipal departments or private entities (distributed clients), each with its own data formats, collection frequencies, and inherent granularities (heterogeneous models/data). Furthermore, privacy concerns regarding citizen data are paramount. The challenge is to fuse these disparate, multi-modal data streams into a coherent, multi-granular representation to inform urban policy, predict resource demands, optimize city services, or manage emergencies. The paper's approach of creating a shared, multi-granular representation from diverse local models, while maintaining data privacy and minimizing communication, directly parallels the need to integrate urban data for holistic city management and understading without centralizing sensitive or proprietary information.

What If Scenario

Imagine a systems biologist, grappling with the complexity of integrating multi-omics data from a consortium of hospitals, each with unique patient populations and data collection methods. If this researcher were to "steal" FedMRL's exact equations tomorrow, they could implement a federated multi-omics learning framework. Each hospital would train its local model on its specific omics data, and a central server would coordinate the fusion of these diverse representations into a shared, multi-granular Matryoshka representation. This would allow for the discovery of robust, hierarchical biomarkers for complex diseases (e.g., cancer subtypes, drug resistance mechanisms) across the entire consortium, without any hospital needing to share raw, privacy-sensitive patient data. The personalized representation projector would adapt the fused omics features to each hospital's unique patient demographics or technical biases, leading to highly accurate and generalizable predictive models. This breakthrough would accelerate personalized medicine by enabling large-scale, privacy-preserving multi-omics research, identifying subtle, nested biological patterns that are currently obscured by data heterogeneity and privacy barriers.

Universal Library of Structures

This paper enriches the "Universal Library of Structures" by demonstrating a robust pattern for decentralised, multi-modal information synthesis, where diverse local perspectives are harmonized into a shared, hierarchical understanding without compromising individual autonomy or privacy.