Federated selective aggregation for on-device knowledge amalgamation

Donglin Xie; Ruonan Yu; Gongfan Fang; Jiaqi Han; Jie Song; Zunlei Feng; Li Sun; Mingli Song

doi:10.1016/j.chip.2023.100053

Chip >

2023 , Vol. 2 >Issue 3: 100053 - 9

DOI: https://doi.org/10.1016/j.chip.2023.100053

Research article

Federated selective aggregation for on-device knowledge amalgamation

Donglin Xie ¹^,^† ,
Ruonan Yu ¹^,^† ,
Gongfan Fang ¹ ,
Jiaqi Han ¹ ,
Jie Song ² ,
Zunlei Feng ² ,
Li Sun ³ ,
Mingli Song ^,¹^,^*

Expand

¹ Key Laboratory of Visual Perception, Ministry of Education and Microsoft, Zhe-jiang University, Hangzhou 310000, China
² School of Software Technology, Zhejiang University, Ningbo 315000, China
³ Ningbo Innovation Center, Zhe-jiang University, Ningbo 315000, China

*E-mail: brooksong@zju.edu.cn (Mingli Song)

† These authors have equal contributions to this work.

Received date: 2023-01-15

Accepted date: 2023-05-23

Online published: 2023-06-01

Fold

Abstract

In the current work, we explored a new knowledge amalgamation problem, termed Federated Selective Aggregation for on-device knowledge amalgamation (FedSA). FedSA aims to train an on-device student model for a new task with the help of several decentralized teachers whose pre-training tasks and data are different and agnostic. The motivation to investigate such a problem setup stems from a recent dilemma of model sharing. Due to privacy, security or intellectual property issues, the pre-trained models are, however, not able to be shared, and the resources of devices are usually limited. The proposed FedSA offers a solution to this dilemma and makes it one step further, again, the method can be employed on low-power and resource-limited devices. To this end, a dedicated strategy was proposed to handle the knowledge amalgamation. Specifically, the student-training process in the current work was driven by a novel saliency-based approach which adaptively selects teachers as the participants and integrated their representative capabilities into the student. To evaluate the effectiveness of FedSA, experiments on both single-task and multi-task settings were conducted. The experimental results demonstrate that FedSA could effectively amalgamate knowledge from decentralized models and achieve competitive performance to centralized baselines.

Key words： Federated learning; Knowledge amalgamation; Model reusing

Cite this article

Donglin Xie , Ruonan Yu , Gongfan Fang , Jiaqi Han , Jie Song , Zunlei Feng , Li Sun , Mingli Song . Federated selective aggregation for on-device knowledge amalgamation[J]. Chip, 2023 , 2(3) : 100053 -9 . DOI: 10.1016/j.chip.2023.100053

INTRODUCTION

The past few years have witnessed tremendous progress in deep learning in many, if not all, machine learning applications, such as computer vision¹, natural language process², and speech recognition³. The success of deep learning, in reality, is significantly attributed to the model-sharing convention of the community: adopting a well-behaved and publicly-verified network pre-trained by others, one may significantly reduce the retraining effort and enjoy the favorable performance resulting from thousands of GPU hours in a plug-and-play manner, especially when the annotated data is limited. Nevertheless, such an increasing trend of model-sharing confronts a dilemma. Many researchers or institutes have spent enormous resources on training powerful networks. However, they cannot share their models with the public due to privacy, security or intellectual property issues, even if it is in their interest to contribute to the community.

The existing model-reuse strategies, such as knowledge distillation^4-6, transfer learning^7-9, and knowledge amalgamation^10-12, typically require pre-trained models to be fully available. Besides, these methods need much computation on the device. Thus, they are inapplicable in this case.

To remedy this issue, a new knowledge amalgamation method termed Federated Selective Aggregation (FedSA) was introduced, the aim of which is to train an on-device student model with the help of several decentralized teachers. We further relaxed the constraint on the specialization of the student: under FedSA, the student model is allowed to tackle a new task differently from any of the teachers. To ensure that the model information is not leaked during training, details of pre-trained models, including their pre-training tasks and data, should be kept private and thus agnostic to other participants. Such a problem setup inevitably imposes enormous challenges to the federated model reusing due to the difficulty of finding helpful teachers for target tasks. Furthermore, as the training domain of different teachers diverges, the student training would have to account for balancing the knowledge aggregated from different teachers.

To this end, a dedicated solution towards knowledge amalgamation was proposed, which was aimed to customize a student leveraging the pre-training knowledge of private models. Instead of being designed for training a student by imitating the teacher, the approach in the current work filters and absorbs proper knowledge from teachers, which is achieved through a saliency-based training schema. Saliency analysis^13-15. has been widely used in the literature on network interpretability, and has recently been adopted for revealing the transferability of networks¹⁶. Based on the saliency analysis, an adaptive training schema was developed to select useful teachers without accessing models directly. Specifically, the FedSA approach alternately trains a student model using limited labeled data of the target task and estimates the representation similarity between the student and private teachers for knowledge selection, as shown in Fig. 1. The selected knowledge will be transferred to students through representation distillation within local nodes, amalgamated via averaging, and then fine-tuned to the target data.

View original graphic|Download|PPT slide

Fig. 1. FedSA aims to train a student model with the help of selective knowledge from pre-trained but private models. The target task and training data can be different from private teachers, allowing flexible model customization in downstream tasks.

Notably, unlike conventional federated learning frameworks that only focus on data privacy, the proposed FedSA approach also takes model privacy into consideration. It enables model knowledge transfer without direct access, which provides a flexible and secure way for model sharing. Besides, the way of distributed training decreases the computation burden of the device since most of the computation is performed on servers with more abundant computation resources. Thus, the proposed FedSA is more suitable to be deployed on the device.

To validate the effectiveness of the proposed approach, extensive experiments were conducted on single-task and multi-task scenarios, respectively, with datasets, including CIFAR10, CIFAR100¹⁷, ImageNet32¹⁸ and Taskonomy¹⁹. The results show that the approach successfully amalgamated proper knowledge from teachers to students. Besides, we analyzed the computation and communication cost of FedSA and several state-of-art knowledge amalgamation methods^10-12. The result indicates that FedSA is more device friendly.

In summary, the main contribution of the current work is a dedicated FedSA approach for training on-device student models to handle new tasks with the help of decentralized teachers. The proposed approach is endowed with two key advantages. Firstly, the method does not require access to private models or their origin training data and is able to utilize their knowledge to tackle downstream tasks. Secondly, the approach needs much less computation and communication costs. These advantages were validated with various experiments, showing the effectiveness and high scalability of FedSA in both single-task and multi-task settings.

RELATED WORK

Federated learning

Federated learning^20-24 develops a decentralized training schema for privacy-preserving learning, enabling multiple clients to learn a network collaboratively without sharing their private data. Note that conventional federated learning frameworks require the models of different clients to have the same architectures²⁵, which hinders the model customization for different applications. Li et al.²⁶ proposed FedMD to tackle the heterogeneous models through knowledge distillation. In some cases, data from different clients can be non-i.i.d, making conventional federated learning challenging to converge. Li et al.²⁷ introduced FedProx to tackle the heterogeneity in data and systems. Smith et al.²⁸ introduced federated multi-task learning and designed a robust and systems-aware optimization method to address this problem, termed MOCHA, to handle practical systems issues. Further, federated transfer learning²⁹ is another topic related to this work, which is proposed to transfer knowledge across domains in federated settings. This work focused on a more challenging setting, where both model-privacy and data-privacy were taken into consideration, and all information, including data domain, tasks and model architectures, are agnostic.

Knowledge amalgamation

Knowledge amalgamation focuses on a student learning problem from multiple teachers models in different domains^10-12,30-34. It aims to train a versatile student model that can handle all tasks from teachers. For example, Shen et al.¹⁰ proposed a layer-wise coding approach to fuse the representation of different tasks and learn a comprehensive classifier from multiple teacher classifiers. Luo et al.¹¹ introduced the idea of common feature learning to extract shared knowledge between teachers. Ye et al.³⁰ proposed a multi-task amalgamation method, which learns a superior student model from teachers to handle various tasks, including semantic segmentation, depth estimation and surface normal estimation. Luo et al.¹² introduced the dual-step adaptive competitive-cooperation training approach for learning the multi-talent student model. Knowledge amalgamation has recently been extended to non-euclidean data like the graph, where a multi-talent student model was trained for point cloud classification and segmentation³⁴. Zhang et al.³³ introduced a new knowledge amalgamation method for transformer, where the student model was trained for object detection. In the current work, a new setting was investigated for knowledge amalgamation, in which the teacher models were not directly available.

Model reusing

As an increasing number of pre-trained models have been released by developers, reusing pre-trained models has become a common practice in the community for solving downstream tasks. The seminal work of Hinton⁴ established the basic framework of knowledge distillation, which aimed to learn a lightweight student model from cumbersome teachers. Following this pioneering framework, many algorithms have been proposed for model reusing. In particular, Ahn et al.⁸ proposed VKD to transfer the representative ability of a pre-trained model to a student. Romero et al.⁵ proposed FitNets to transfer the feature extraction ability of the pre-trained model, which adopted the feature map as the supervision. Lopes et al.³⁵ proposed data-free knowledge distillation to reuse public models without accessing the original training data. Jiao et al.³⁶ utilized knowledge distillation to compress a transformer model for efficient inference. In the current work, a selective model reusing problems in federated settings was taken into consideration, and FedSA was proposed to train a student model on a different task.

METHODS

Problem setup

M= {M1,M2,…,MN} denotes a set of pre-trained models. The models are trained on the local private datasets D={D1,D2,…,DN}. The models and datasets can only be accessed on their servers. The goal of FedSA is to obtain a model MT that excels on the target task T with the help of several decentralized pre-trained models. For ease of description, θ‾n indicates the parameters of nth pre-trained model, and θT indicates the parameters of the target model. The annotated data of target task T is DT={{x1,y1},{x2,y2},…,{xM,yM}}. The expensive labeling costs cause the amount of annotated data to be usually limited. Besides, the method was performed on resource limited device.

Federated knowledge amalgamation

Overview

A novel framework named Federated Knowledge Amalgamation (FedSA) was proposed. An overall framework of FedSA is shown in Fig. 2. With saliency-based method, the proper models S^ are chosen for further selective aggregation. Owing to the progressive learning of the target model, the process of knowledge selection and selective aggregation is iterative. After several rounds of selective aggregation, the target model was adapted to the target data. The target model learned task-related knowledge through target task adaptation, and achieved competitive performance.

View original graphic|Download|PPT slide

Fig. 2. Overview of the FedSA framework. It comprises three parts: saliency-based knowledge selection, selective aggregation, and target task adaptation. Knowledge selection picks proper models with saliency maps, selective aggregation distills the knowledge from the pre-trained models to local students, and task adaptation strengthens the performance of the target model.

Saliency-based knowledge selection

With massive variation in knowledge among models, the proper model is more beneficial to the target model, so it is essential to determine a way to select the proper models. The knowledge about target task is pivotal before selection, so the target model was trained with limited target data.

Then the saliency-based method was adopted to choose proper models. The saliency map is widely used to explain the attention of models, it reflects the knowledge of models, so the saliency map was adopted by the knowledge selection to measure the knowledge transferability. In detail, the representation of models were calculated by averaging the saliency map of target data. The transferability scores are pair-wise similarity of model’s representation.

Let

a_{n, j}^{k} \in R^{W H C}

indicate saliency map for one input image xj in the probe data

D_{T}

. It can be computed through one single forward and backward propagation,

(1)

a_{n, j}^{k} = {[| \frac{\partial r_{k}}{\partial x_{j}} |]}^{W_{n} H_{n} C_{n}},

where, rk is the network’s hidden representation of the k-th layer. Let

A_{n}^{k}

denote the overall saliency map for the k-th layer of n-th model Mn. It can be formed by averaging the all saliency maps

a_{n, j}^{k} |_{j = 1}^{M}

. Formally, we have

A_{n}^{k} = \frac{1}{M} \sum_{j = 1}^{M} a_{n, j}^{k}

. After measuring N attribution maps for the models in M, we get

A = {A_{1}^{k}, A_{2}^{k}, \dots, A_{N}^{k}}

. The transferability score of two models can be estimated as follows:

Q_{n, T} = \frac{c o s_s i m (A_{n}^{k}, A_{T}^{k})}{M},

where,

c o s_s i m (A_{n}^{k}, A_{T}^{k}) = \frac{A_{n}^{k} \cdot A_{T}^{k}}{∥ A_{n}^{k} ∥ \cdot ∥ A_{T}^{k} ∥}

, M is the number of target data. Then the selection probabilities are calculated by normalizing the transferability scores between the target model MT and candidate models

{M_{n}}_{n = 1}^{N}

. The probabilities are formed as:

(3)

p_{n} = \frac{e^{Q_{n, T}}}{\sum_{b = 1}^{N} e^{Q_{b, T}}} .

The selected model indexes set

\hat{S}

is established by replacement sampling with probabilities

{p_{1}, p_{2}, \dots, p_{N}}

. The theoretical analysis in the supplemental materials indicates that this replacement sample policy guarantees the convergence of the algorithm.

Selective aggregation

After the knowledge selection, the selected model set

\hat{S}

has been established. The selective aggregation distills the knowledge from the selected decentralized models to the target model. Due to privacy issues, models and data may only be allowed for internal usage.

In FedSA, a local student model was employed as the messenger of communication between the target and pre-trained models. It learns the knowledge from the pre-trained models and is uploaded to the central device for further aggregation. Most neural networks can be divided into the encoder and the decoder, and the output features of the encoder include rich and semantic information. Also, the feature extraction ability of the teacher model is more generalizable. Therefore, the local student learns the knowledge from the local teacher model through the feature-based knowledge distillation^5-37.

Due to the heterogeneous network architectures, the output feature dimensions of teacher and student models could be distinct. A translator block formed by three convolutions with 1×1 kernel was used to align the teacher and student output dimensions. The student output was converted through the block to a predefined output length. Let

F_{n}^{s}

and

F_{n}^{p}

indicate the aligned features of the student model and teacher model Mn respectively. The X denotes the training data of model Mn. The θnt indicate the parameters of the n-th local student in t-th iteration. In order to encourage the intermediate output of the student model to imitate that of the teacher model, the loss of feature knowledge distillation is computed as follows:

(4)

L_{K D} = \frac{1}{2} {∥ F_{n}^{p} (X; {\overline{θ}}_{n}) - F_{n}^{s} (X; θ_{n}^{t}) ∥}^{2} .

After the local knowledge distillation, the parameters of local students θnt were uploaded to the center server.

To aggregate the knowledge of different local student models, the update of the target model’s parameters with K selected teachers is as follows:

(5)

θ_{t + 1} = \frac{1}{K} \sum_{n \in \hat{S}} θ_{n}^{t} .

Direct access to pre-trained models was avoided by the local student models, ensuring the model and data privacy. Besides, the most of the calculations was performed on the server, and the aggregation on the central device was designed to be very efficient, so it is easy to be performed on device.

Task adaptation

In the selective aggregation, the target model MT only learns the knowledge about generalizable feature extraction, so it is natural to make target task adaptation for more incredible performance. In detail, the decoder part of the target model was trained with the annotated target data. The annotated target data is usually limited, so it can be performed efficiently on device.

Algorithm summary

The proposed method is summarized in Algorithm 1, which includes three stages: 1) saliency-based knowledge selection; 2) selective aggregation; 3) task adaptation. The overall objective function of FedSA is as follows:

(6)

{min}_{θ_{T}} \sum_{i = 0}^{M} L_{t a s k} (y_{i}, M_{T} (x_{i}, θ_{T} : θ_{T}^{0} \leftarrow \frac{1}{K} \sum_{n \in \hat{S}} θ_{n}^{t})),

where,

L_{t a s k}

refers to the loss function of target task.

EXPERIMENTS

In this section, the proposed method FedSA was validated and the performance evaluated via two sets of quantitative experiments, single-task amalgamation and multi-task amalgamation. In single-task amalgamation, the experiments were conducted on both homogeneous and heterogeneous networks. Furthermore, ablation, analytical and visualization experiments were conducted to validate the effectiveness of FedSA. We also focused on analyzing the calculation and communication costs required by our proposed method and several current SOTA knowledge amalgamation methods. More experimental details can be found in the supplementary material.

Experimental settings

Datasets and models

The single-task amalgamation experiments were conducted with the adoption of three public benchmark datasets, i.e., CIFAR10, CIFAR100¹⁷ and ImageNet32¹⁸. Among them, CIFAR10 and CIFAR100 were used as the source dataset to construct the target or probe dataset, and ImageNet employed to construct the local private dataset. More specifically, to compose the target dataset, {1%, 5%, 10%, 100%} images were randomly extracted from each class of the source dataset CIFAR10 or CIFAR100. ImageNet32 was divided into ten subsets since we have ten local teachers, which contain 100 classes each without overlap, and those sub-datasets were used as the local private datasets. ResNet34³⁸ was adopted as the network structure of local private teachers, and those local teacher models were pre-trained on their corresponding private datasets. ResNet34, ResNet18³⁸, VGG19³⁹, WRN_16_2⁴⁰ and MobileNetV2⁴¹ were utilized as the option of network structures of local student models and target model for homogeneous and heterogeneous experiment settings. Experiments for multi-task amalgamation were performed on Taskonomy dataset¹⁹. The taskonomy dataset includes over 4 million images of indoor scenes from around 600 buildings, and each image has 26 annotations for all kinds of visual tasks. Moreover, Taskonomy also provided 26 pre-trained models focusing on different visual tasks. The network structure of these models can be unified into two parts, namely, the encoder and the decoder. A visual task was selected from the task dictionary as the target task and the rest 25 off-the-shelf ResNet50 with different decoders used as local teacher models to help train the target model for the chosen task.

Segmentation details

All the methods were implemented in PyTorch on a Quadro P6000 GPU in the experiment. Single-task amalgamation experiments focused on classification tasks, where pre-trained local private teacher models and the target student model were responsible for different classes. As for multi-task amalgamation, the target student model was expected to solve a new visual task. A task from the task dictionary was chosen as the target task, and the pre-trained models of the remaining tasks used as the local teacher models. As for fair comparison, all compared methods utilized the knowledge of the pre-trained models used in FedSA, and experiment settings like the network architecture and target data were kept the same. The hyperparameters settings in the experiment will be unified according to the impact results in the algorithm. For more details, refer to the supplementary information.

Single-task amalgamation

Single-task amalgamation focuses on the classification task. Different local tasks refer to non-overlap classification classes, and the classification ability of local teacher models is transferred to help classify the new categories in the target task. To show the effectiveness of the proposed method, the following experiments were conducted.

Baselines

Eight methods Scratch Training, Transfer, KACC¹⁰, KACFL¹¹, SOKA-Net¹², FedAvg+²⁰, FedMD+²⁶ and FedProx+²⁷ were adopted as baselines in experiments to validate the effectiveness of the proposed method. For Scratch Training, the target dataset was taken to train the models directly. The Transfer method adopted the target dataset to train the models pre-trained with subsets of the ImageNet32 (local private datasets). As no method directly addresses the problem, three federated methods, FedAvg+, FedMD+ and FedProx+ were modified for fair comparison. In order to keep the same privacy implication, the feature-based knowledge distillation step was also in the local update phase of FedAvg+, FedMD+ and FedProx+, but there is no knowledge selection phase in these methods. After the iterative federated learning phrase, the model from three federated methods was fine-tuned on the downstream target data. Three knowledge amalgamation methods, KACC¹⁰, KACFL¹¹ and SOKA-Net¹² were also selected to validate the performance of FedSA further. We relaxed the data privacy constraints in these knowledge amalgamation methods.

Results on single-task amalgamation

Table 1 shows the results of the proposed FedSA and baselines, the proposed FedSA achieved the SOTA results in all the experimental settings. The transfer methods achieved better results than the scratch training method, owning to the knowledge in the pre-trained model. Compared with the modified federated learning methods, the results are improved due to the effectiveness of collaboration, but differences still exist. In addition, several state-of-art knowledge amalgamation methods are inferior to our innovative method FedSA. In order to further complete the conception of model privacy protection, extensive experiments on heterogeneous networks were conducted, in which different nodes could exhibit diverged model architectures. FedSA also achieved the best performance.

Table 1. The performance of FedSA and baseline methods on CIFAR10 and CIFAR100. Symbol + indicates that the method is modified for our problem settings.

Dataset	Method	Homogeneous				Empty Cell	Heterogeneous (10%)
Dataset	Method	1%	5%	10%	100%	Empty Cell	Wrn	Res18	Vgg	Mob
CIFAR-10	Scratch Training	40.00	62.14	75.43	91.56		73.97	74.91	73.63	66.23
	Transfer	63.70	79.38	83.34	92.88		-	-	-	-
	KACC¹⁰	64.52	81.43	84.52	93.43		83.96	84.60	75.77	80.44
	KACFL¹¹	37.66	53.37	62.09	88.73		63.16	64.56	59.96	34.61
	SOKA-Net¹²	72.84	79.16	80.65	91.58		80.74	83.01	81.44	81.28
	FedAvg+²⁰	69.96	71.58	78.04	89.24		76.05	71.69	80.97	81.87
	FedMD +²⁶	71.25	83.20	85.52	92.37		81.28	84.23	80.76	71.17
	FedProx +²⁷	36.51	55.88	63.05	87.07		46.11	65.11	61.17	49.47
	FedSA	77.15	85.30	87.57	93.83		84.70	86.39	85.71	82.04
CIFAR-100	Scratch Training	9.24	21.00	29.25	66.75		28.99	29.82	19.78	26.05
	Transfer	18.43	42.54	48.45	69.27		-	-	-	-
	KACC¹⁰	19.28	39.41	47.61	71.86		47.05	46.95	26.85	40.98
	KACFL¹¹	6.24	12.67	28.79	46.62		18.79	21.58	13.43	20.41
	SOKA-Net¹²	26.69	49.79	54.63	69.75		48.62	53.72	51.92	46.09
	FedAvg +²⁰	20.93	29.35	36.08	62.91		42.76	34.46	49.85	44.32
	FedMD +²⁶	28.67	44.73	50.31	70.28		42.30	49.09	42.38	30.35
	FedProx +²⁷	4.82	15.72	21.86	52.67		16.29	22.36	11.46	13.88
	FedSA	32.99	52.76	57.87	73.49		50.49	58.32	53.29	47.25

Ablation study on aggregation strategy

In order to investigate the effectiveness of the selection and aggregation strategy, several methods with different knowledge selection and aggregation strategies were implemented for comparison. Apart from the proposed method FedSA, four knowledge selection strategies and three knowledge aggregation ways were taken into consideration. The following are the knowledge selection strategies. Random means model selection with equal probability. Fixed means model selection with the highest transferability scores, and the selected models set are immutable. Top-K means the selection of the k models with the highest transferability scores, while Least-K with the lowest scores. We have three different aggregation ways: weighted with positive correlation to transferability scores (referred to as Positive Weighted), weighted with negative correlation to similarity (referred to as Negative Weighted), and Unweighted.

Table 2 shows the results of the methods with the different knowledge selection and amalgamation strategies. Overall, the performance of FedSA, with high accuracy and stability advantages, is quite competitive for all settings. Fixed gets the worst results due to the overfitting problem. Random sometimes may exhibit good performance but is not applicable for practical applications since it is unstable. Least-K knowledge selection method is less competitive than Top-K and the selection method of FedSA for fewer similarities in saliency maps. However, with the updation of the target model, the selected models tend to be fixed with Positive Weighted, which may also cause the overfitting problem and is detrimental to improving the effectiveness of the global model. As a result, FedSA, with stability and mobility, can select models with high similarity and avoid the problem of over-fitting to some degree.

Table 2. The performance of different amalgamation strategies. Accuracy of student models amalgamated from 10 homogeneous and heterogeneous ImageNet32 teachers.

Dataset	Strategy	Homogeneous				Heterogeneous (10%)
Dataset	Strategy	1%	5%	10%	100%	Wrn	Res	Vgg	Mob
CIFAR-10	Fixed + Unweighted	73.46	82.22	85.28	93.11	83.89	84.00	83.78	80.49
	Top-K + Unweighted	75.14	84.15	85.55	93.28	83.23	85.76	84.07	81.76
	Top-K + Positive Weighted	74.65	79.89	87.38	93.16	83.69	84.70	83.55	81.19
	Top-K + Negative Weighted	76.54	83.52	85.87	93.51	83.36	85.29	84.26	81.50
	Least-K + Positive Weighted	75.50	84.44	85.16	93.21	82.78	85.72	83.85	81.14
	Random + Positive Weighted	75.81	84.15	86.45	93.65	84.17	86.10	85.23	81.30
	FedSA	77.15	85.30	87.57	93.83	84.70	86.39	85.71	82.04
CIFAR-100	Fixed + Unweighted	28.42	48.72	54.05	71.38	49.08	54.97	48.71	44.21
	Top-K + Unweighted	32.78	49.10	55.01	71.99	50.37	55.18	50.04	44.33
	Top-K + Positive Weighted	32.80	51.22	54.50	71.67	50.24	54.42	49.88	45.37
	Top-K + Negative Weighted	32.69	50.22	56.13	72.44	49.39	55.34	51.16	45.24
	Least-K + Positive Weighted	32.74	50.95	55.61	71.85	49.76	55.74	49.57	45.88
	Random + Positive Weighted	32.46	52.39	56.77	72.97	50.03	57.76	51.98	46.96
	FedSA	32.99	52.76	57.87	73.49	50.49	58.32	53.29	47.25

Visualization of dynamic knowledge selection

In this part, the transferability scores of different stages when training on CIFAR10 and CIFAR100 are visualized in Fig. 3. It indicates that, as the knowledge of the target model increases, the most proper models for the target model are changed, especially from the first phase of knowledge selection to the second phase. Therefore, it indicates that knowledge selection needs to be dynamic. It is more in line with the characteristics of progressive learning.

View original graphic|Download|PPT slide

Fig. 3. a, Visualization of Dynamic Knowledge Selection for CIFAR10 and CIFAR100. The darker color indicates the higher transferability scores. b-c, The performance of FedSA with different number of nodes selected.

Analytical experiments Several experiments were conducted on the different number of nodes selected from clients to explore the factors affecting the results of the proposed method. The experiment settings of this section are the same as those of the single-task experiments, and the number of chosen nodes ranges from one to five. The Results are shown in Fig. 3. From the results, fewer nodes may lead to limited knowledge transfer; more nodes may result in off-task knowledge transfer and unpredictable results.

The effectiveness of transferability scores Experiments were conducted to demonstrate the effectiveness of transferability scores in this section. 10% of CIFAR data were used as the target dataset. Teacher models were pre-trained by local private datasets. Transferability scores were calculated between the target and local teacher models, respectively, and then normalized to 0 to 1 for contrast. All parameters of the local teacher models, except the last layer, were frozen, and the teacher models were trained to evaluate the transfer accuracy. Table 3 shows that the transfer accuracy and transferability score generally correlate positively.

Table 3. Relation between transfer accuracy and saliency similarity. The similarity is normalized to [0,1] for contrast. High saliency typically reveals better transfer performance, which validates the effectiveness of saliency-based knowledge selection.

Dataset	Teacher ID	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10
CIFAR-10	Transfer Acc. (%)	62.13	61.38	63.73	66.51	63.28	65.48	62.41	62.97	63.93	61.99
CIFAR-10	Transferability Score	0.35	0.61	0.92	1.0	0.11	0.61	0.0	0.26	0.53	0.19
CIFAR-100	Transfer Acc. (%)	31.76	33.17	30.19	39.04	41.04	42.17	41.53	41.82	43.89	39.27
CIFAR-100	Transferability Score	0.43	0.23	0.0	0.25	0.42	0.74	0.48	0.42	1.0	0.77

Multi-task amalgamation

Results on taskonomy

In order to show the effectiveness of the proposed method on the cross-task settings, experiments on Taskonomy dataset were conducted¹⁹. Taskonomy provides 26 types of annotations for different visual tasks and the corresponding pre-trained models. In this section, FedSA was deployed to amalgamate knowledge from different models to resolve a different task.

The taskonomy dataset was divided into training, validation and test set. The current experiment adopted these three parts for the local private, target and test datasets. In the experiment, one of the tasks in taskonomy was selected as the target task, while others were regarded as the task of the pre-trained teacher models. In this experiment, semantic segmentation and object classification were selected as the target task to validate the effectiveness of FedSA. Although the proposed FedSA was only verified on object classification and semantic segmentation, the method can be easily adapted to other visual tasks. As shown in Table 4, the results indicate that the proposed FedSA is quite competitive compared with others, especially on semantic segmentation tasks. The results of Scratch Training are unsatisfactory for the limited labeled data. As for the Transfer method, the result was improved effectively with the prior knowledge from the pre-trained models. Owing to the redundancy of knowledge, FedAvg+ does not exhibit excellent performance. The KACC method amalgamates multiple pre-trained models so that the target model can make an accurate prediction. In the proposed method FedSA, the appropriate knowledge from different models was amalgamated into the target model with task adaption. Therefore, more accurate results were achieved by the proposed method FedSA.

Table 4. Performance of target model amalgamated from Taskonomy models. The classification and semantic segmentation are used as the target task for comparison.

Task / Metric	Method	Performance
Cls. / Acc. (%)	Scratch Training	31.47
	Transfer	35.68
	KACC¹⁰	37.48
	FedAvg+²⁰	31.94
	FedSA	51.58
Seg. / mIoU	Scratch Training	0.07
	Transfer	0.21
	KACC¹⁰	0.18
	FedAvg+²⁰	0.14
	FedSA	0.41

Visualization of semantic segmentation

In this section, three images from the test set in taskonomy were chosen as the input of the target model, and the semantic segmentation predictions from the target models are visualized in Fig. 4. The figure shows the effect of the proposed FedSA more intuitively. In Fig. 4, the proposed FedSA predicts some small objects correctly, while other methods ignore or make wrong predictions. The visualization results indicate that the proposed FedSA can achieve a more accurate target model than other methods.

View original graphic|Download|PPT slide

Fig. 4. The visualization semantic segmentation results of proposed FedSA and some baselines. The transfer method requires the pre-trained models to be accessible.

Computation and communication cost analysis

The computation resources and network communication ability of the device are usually limited. This part compared the computation and communication costs of some SOTA knowledge amalgamation methods and FedSA. FLOPs, the abbreviation of floating point operations, means floating-point operands and is regarded as the amount of calculation that can measure the complexity of the algorithm or model. Here, in order to show the feasibility of the proposed FedSA on practical deployment, we compared and analyzed the FLOPs. C.B. means the total number of bytes that needs to be communicated. It includes device-to-server and server-to-device communication.

For simple comparison, ResNet18 was taken as the network architecture of the target model and CIFAR10 as the target dataset. Here, it is assumed that the number of the target dataset is 5000 (10% of the whole CIFAR10 dataset), the number of selective models is 3, the communication rounds are 50, and the epochs for fine-tuning is 30. The number of pre-trained model data is 383,450 (30% of ImageNet32 dataset). The ratio of backward-forward FLOPs in neural networks is typically between 1:1 and 3:1 and almost often close to 2:1⁴². We take 2:1 in the current work.

KACC

The amalgamation procedure of KACC follows two steps: feature amalgamation and parameter learning. Thus, we get

7.39 M \times 383, 450 \times 50 = 141.68 T

FLOPs for feature amalgamation and (556.71M×3+7.39M×2)×383,450×50=32303.94T FLOPs for parameter learning. The fine-tuning of target model is

(556.71 M + 5.12 K \times 2) \times 5000 \times 30 = 83.51 T

, as the ratio of backward-forward FLOPs for linear layer is 1:1. The middle feature maps are required to be communicated between the device and servers, so the total number of communication bytes is

8 K \times 3 \times 383, 450 \times 50 = 438.82 G

KACFL

There are individual adaption layers and shared extractor in the feature amalgamation of KACFL. Therefore, we get

(5.54 M + 1.39 M) \times 4 \times 383, 450 \times 50 = 531.46 T

FLOPs for feature amalgamation and

(556.71 M \times 3 + 5.54 M \times 4 \times 2 + 1.39 M \times \times 2) \times 383, 450 \times 50 = 33083.49 T

FLOPs for the parameter learning. The fine-tuning of target model is

(556.71 M + 5.12 K \times 2) \times 5000 \times 30 = 83.51 T

. The output feature maps of pre-trained teachers need communication in KACFL, so the total number of communication bytes is

8 K \times 3 \times 383, 450 \times 50 = 438.82 G

SOKA-Net

In SOKA-Net, the feature are amalgamated with MIA module. We get

4.19 M \times 3 \times 383, 450 \times 50 = 240.99 T

FLOPs for feature amalgamation and

(556.71 M \times 3 + 4.19 M \times 3 \times 2) \times 383, 450 \times 50 = 32502.56 T

FLOPs for target model parameter learning. The fine-tuning of target model is

(556.71 M + 5.12 K \times 2) \times 5000 \times 30 = 83.51 T

. The output feature maps of pre-trained teachers also need communication in SOKA-Net, so the total number of communication bytes is

8 K \times 3 \times 383, 450 \times 50 = 438.82 G

FedSA

FedSA mainly has two steps for the knowledge amalgamation procedure: aggregate the knowledge from selective clients and utilize the target dataset to fine-tune the last linear layer of the target model. Thus, we get

11.17 M \times 3 \times 50 = 1675.5 M

FLOPs for knowledge aggregation, as only the parameters of agent models need averaging, and

(556.71 M + 5.12 K \times 2) \times 5000 \times 30 = 83.51 T

FLOPs for fine-tuning. Besides, the parameters of target models are delivered, so the total number of communication bytes is

11.17 M \times 3 \times 2 \times 50 = 3.43 G

The computation and communication cost analysis results for our proposed method, which are compared with some current SOTA knowledge amalgamation methods, are shown in Table 5. It indicates that the framework in the current work exhibits far lower computation (390×) and communication costs (128×) than other knowledge amalgamation methods, as we can obtain a target student without any assistance from the teachers’ training data, and demonstrating the feasibility of FedSA for practical deployment on the device.

Table 5. The computation and communication costs of several SOTA knowledge amalgamation methods and FedSA. C.B. means the total number of bytes needing to be communicated.

Method	FLOPs (T)	C.B. (GB)
KACC¹⁰	32529.13	438.82
KACFL¹¹	33698.46	438.82
SOKA-Net¹²	32827.06	438.82
FedSA	83.51	3.43

CONCLUSIONS

In the current work, a new knowledge amalgamation problem was investigated, the goal of which is to train a target model for a new task with the help of several decentralized pre-trained models. To this end, a new model was proposed by reusing the scheme termed FedSA. Specifically, an efficient saliency-based approach was devised to estimate the importance of different teachers and selectively aggregate proper knowledge from pre-trained models to the target model, subsequently the target model was adapted to the target task. Extensive experiments demonstrated that FedSA performs better than baseline methods across various datasets. Besides, the proposed FedSA could reduce the computation and communication cost, making it more suitable to be deployed on device than other knowledge amalgamation methods.

MISCELLANEA

Supplementary material Supplementary material associated with this arti- cle can be found, in the online version, at doi: 10.1016/j.chip.2023.100053.

Acknowledgements This work is supported by National Natural Science Foundation of China (61976186, U20B2066), and the Fundamental Research Funds for the Central Universities (2021FZZX001-23, 226-2023-00048).

Declaration of Competing Interest The authors declare no competing interests.

References

Publishing order | Descend order by publishing year | Descend order by cited within

1.	Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84-90 (2017). https://doi.org/10.1145/3065386.

2.	Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805(2018).

3.	Amodei, D. et al. Deep speech 2: end-to-end speech recognition in English and mandarin. Preprint at https://doi.org/10.48550/arXiv.1512.02595(2015).

4.	Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://doi.org/10.48550/arXiv.1503.02531(2015).

5.	Romero, A., et al. FitNets: hints for thin deep nets. Preprint at https://doi.org/10.48550/arXiv.1412.6550(2014).

6.	Zagoruyko, S. & Komodakis, N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. Preprint at https://doi.org/10.48550/arXiv.1612.03928(2016).

7.	Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? In 2014 the 27th International Con- ference on Neural Information Processing Systems (NIPS), 3320-3328 (2014).

8.	Ahn, S., Hu, S. X., Damianou, A., Lawrence, N. D. & Dai, Z. Variational information distillation for knowledge transfer. Preprint at https://doi.org/10.48550/arXiv.1904.05835(2019).

9.	Zhuang, F. et al. A comprehensive survey on transfer learning. Proc. IEEE 109, 43-76 (2021). https://doi.org/10.1109/JPROC.2020.3004555.

10.	Shen, C., Wang, X., Song, J., Sun, L. & Song, M.Amalgamating knowledge to- wards comprehensive classification. In 2019 Proceedings of the AAAI Conference on Artificial Intelligence,3068-3075 (AAAI, 2019). https://doi.org/10.1609/aaai.v33i01.33013068.

11.	Luo, S. et al. Knowledge amalgamation from heterogeneous networks by common feature learning. Preprint at https://doi.org/10.48550/arXiv.1906.10546(2019).

12.	Luo, S. et al. Collaboration by competition: self-coordinated knowledge amal- gamation for multi-talent student learning. In 2020 European Conference on Computer vision (ECCV), 631-646 (Springer, 2020). https://doi.org/10.1007/978-3-030-58539-6_38.

13.	Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), 618-626 (IEEE, 2017). https://doi.org/10.1109/ICCV.2017.74.

14.

Chattopadhay,

, Sarkar,

, Howlader,

. & Balasubramanian,

V. N.

Grad- CAM ++ : generalized gradient-based visual explanations for deep convolu- tional networks. In 2018 IEEE Winter Conference on Applications of Com- puter Vision (WACV), 839-847 (IEEE, 2018). https://doi.org/10.1109/WACV.2018.00097.

15.	Wang, H. et al. Score-CAM: score-weighted visual explanations for convolutional neural networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 24-25. (IEEE, 2020). https://doi.org/10.1109/CVPRW50498.2020.00020.

16.	Song, J., Chen, Y., Wang, X., Shen, C. & Song, M. Deep model transferability from attribution maps. Preprint at https://doi.org/10.48550/arXiv.1909.11902(2019).

17.	Krizhevsky, A. Learning multiple layers of features from tiny images. (Univer- sity of Toronto,

18.	Chrabaszcz, P., Loshchilov, I. & Hutter, F. A downsampled variant of imagenet as an alternative to the CIFAR datasets. Preprint at https://doi.org/10.48550/arXiv.1707.08819(2017).

19.	Zamir, A. R., et al. Taskonomy: disentangling task transfer learning. Preprint at https://doi.org/10.48550/arXiv.1804.08328(2018).

20.	McMahan, H. B., Moore, E., Ramage, D., Hampson, S. & y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. Preprint at https://doi.org/10.48550/arXiv.1602.05629(2016).

21.	Kairouz, P. et al. Advances and open problems in federated learning. Found. Trends Mach. Learn. 14, 141-210 (2021). https://doi.org/10.1561/2200000083.

22.	Augenstein, S., et al. Generative models for effective ML on private, decentralized datasets. Preprint at https://doi.org/10.48550/arXiv.1911.06679(2019).

23.	Kone ˇcn `y, J., et al. Federated learning: strategies for improving communication efficiency. Preprint at https://doi.org/10.48550/arXiv.1610.05492(2016).

24.	Yang, Q., Liu, Y., Chen, T. & Tong, Y. Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10, 1-19 (2019). https://doi.org/10.1145/3298981.

25.	Li, Q. et al. A survey on federated learning systems: vision, hype and reality for data privacy and protection. IEEE Trans. Knowl. Data Eng. 35, 3347-3366 (2023). https://doi.org/10.1109/TKDE.2021.3124599.

26.	Li, D. & Wang, J. FedMD: heterogenous federated learning via model distillation. Preprint at https://doi.org/10.48550/arXiv.1910.03581(2019).

27.	Li, T., et al. Federated optimization in heterogeneous networks. Preprint at https://doi.org/10.48550/arXiv.1812.06127(2018).

28.	Smith, V., Chiang, C.-K., Sanjabi, M. & Talwalkar, A. Federated multi-task learning. Preprint at https://doi.org/10.48550/arXiv.1705.10467(2017).

29.	Saha, S. & Tahir, A. Federated transfer learning: concept and applications. Intell. Artif. 15, 35-44 (2021). https://doi.org/10.3233/IA-200075.

30.	Ye, J., et al. Student becoming the master: knowledge amalgamation for joint scene parsing, depth estimation, and more. Preprint at https://doi.org/10.48550/arXiv.1904.10167(2019).

31.	Thadajarassiri, J., Hartvigsen, T., Kong, X. & Rundensteiner, E.Semi-supervised knowledge amalgamation for sequence classification. In 2021 Proceedings of the AAAI Conference on Artificial Intelligence,9859-9867 (AAAI, 2021). https://doi.org/10.1609/aaai.v35i11.17185.

32.	Li, C., Xu, W., Si, X. & Song, P. KABI: Class-incremental learning via knowledge amalgamation and batch identification. In 2021 the 5th International Conference on Innovation in Artificial Intelligence (ICIAI), 170-176 (2021). https://doi.org/10.1145/3461353.3461367.

33.	Zhang, H. et al. Knowledge amalgamation for object detection with transformers. IEEE Trans. Image Process. 32, 2093-2106 (2023). https://doi.org/10.1109/TIP.2023.3263105.

34.	Jing, Y., Yang, Y., Wang, X., Song, M. & Tao, D. Amalgamating knowledge from heterogeneous graph neural networks. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15704-15713 (IEEE, 2021). https://doi.org/10.1109/CVPR46437.2021.01545.

35.	Lopes, R. G., Fenu, S. & Starner, T. Data-free knowledge distillation for deep neural networks. Preprint at https://doi.org/10.48550/arXiv.1710.07535(2017).

36.	Jiao, X., et al. TinyBERT: Distilling BERT for natural language understanding. Preprint at https://doi.org/10.48550/arXiv.1909.10351(2019).

37.	Park, S. & Kwak, N. Feature-level ensemble knowledge distillation for aggregat- ing knowledge from multiple networks. (IOS Press, 2020). https://doi.org/10.3233/FAIA200246.

38.	He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Preprint at https://doi.org/10.48550/arXiv.1512.03385(2015).

39.	Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://doi.org/10.48550/arXiv.1409.1556(2014).

40.	Zagoruyko, S. & Komodakis, N. Wide residual networks. Preprint at https://doi.org/10.48550/arXiv.1605.07146(2016).

41.	Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. Preprint at https://doi.org/10.48550/arXiv.1801.04381(2018).

42.	What’s the backward-forward FLOP ratio for neural networks? epochai.org. Accessed December 19, 2022.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

INTRODUCTION

Fig. 1. FedSA aims to train a student model with the help of selective knowledge from pre-trained but private models. The target task and training data can be different from private teachers, allowing flexible model customization in downstream tasks.

RELATED WORK

Federated learning

Knowledge amalgamation

Model reusing

METHODS

Problem setup

Federated knowledge amalgamation

Overview

Saliency-based knowledge selection

Selective aggregation

Task adaptation

Algorithm summary

EXPERIMENTS

Experimental settings

Datasets and models

Segmentation details

Single-task amalgamation

Baselines

Results on single-task amalgamation

Table 1. The performance of FedSA and baseline methods on CIFAR10 and CIFAR100. Symbol + indicates that the method is modified for our problem settings.

Ablation study on aggregation strategy

Table 2. The performance of different amalgamation strategies. Accuracy of student models amalgamated from 10 homogeneous and heterogeneous ImageNet32 teachers.

Visualization of dynamic knowledge selection

Fig. 3. a, Visualization of Dynamic Knowledge Selection for CIFAR10 and CIFAR100. The darker color indicates the higher transferability scores. b-c, The performance of FedSA with different number of nodes selected.

Table 3. Relation between transfer accuracy and saliency similarity. The similarity is normalized to [0,1] for contrast. High saliency typically reveals better transfer performance, which validates the effectiveness of saliency-based knowledge selection.

Multi-task amalgamation

Results on taskonomy

Table 4. Performance of target model amalgamated from Taskonomy models. The classification and semantic segmentation are used as the target task for comparison.

Visualization of semantic segmentation

Fig. 4. The visualization semantic segmentation results of proposed FedSA and some baselines. The transfer method requires the pre-trained models to be accessible.

Computation and communication cost analysis

KACC

KACFL

SOKA-Net

FedSA

Table 5. The computation and communication costs of several SOTA knowledge amalgamation methods and FedSA. C.B. means the total number of bytes needing to be communicated.

CONCLUSIONS

MISCELLANEA

References

Links