Abstract
In Machine Learning scenarios, privacy is a crucial concern when models have to be trained with private data coming from users of a service, such as a recommender system, a location-based mobile service, a mobile phone text messaging service providing next word prediction, or a face image classification system. The main issue is that, often, data are collected, transferred, and processed by third parties. These transactions violate new regulations, such as GDPR. Furthermore, users usually are not willing to share private data such as their visited locations, the text messages they wrote, or the photo they took with a third party. On the other hand, users appreciate services that work based on their behaviors and preferences. In order to address these issues, Federated Learning (FL) has been recently proposed as a means to build ML models based on private datasets distributed over a large number of clients, while preventing data leakage. A federation of users is asked to train a same global model on their private data, while a central coordinating server receives locally computed updates by clients and aggregate them to obtain a better global model, without the need to use clients’ actual data. In this work, we extend the FL approach by pushing forward the state-of-the-art approaches in the aggregation step of FL, which we deem crucial for building a high-quality global model. Specifically, we propose an approach that takes into account a suite of client-specific criteria that constitute the basis for assigning a score to each client based on a priority of criteria defined by the service provider. Extensive experiments on two publicly available datasets indicate the merits of the proposed approach compared to standard FL baseline.
Introduction
The vast amount of data generated by billions of mobile and online IoT devices worldwide holds the promise of significantly improving usability and user experience in intelligent applications. Such large-scale quantity of rich data has created an opportunity to dramatically advance the intelligence of machine learning models by catering powerful deep neural network models. Despite this opportunity, we know that such pervasive devices can capture much data about the user, such as what she does, what she sees, and even where she goes [1], where most of these data contain sensitive that a user may deem private.
To respond to concerns about the sensitivity of user data in terms of data privacy and security [2–4], in recent years initiatives have been made by governments to prioritize and improve the security and confidentiality of user data. For instance, in 2018, the General Data Protection Regulation (GDPR) was enforced by the European Union to protect individuals’ privacy and data security. These issues and regulations pose a new challenge to traditional AI models where one party is involved in collecting, processing, and transferring all data to other parties. It is easy to foresee the risks and responsibilities involved in storing/processing such sensitive data in the traditional centralized AI fashion.
Federated learning is an approach recently proposed by Google [5–7] to train a global machine learning model from a massive amount of data, which is distributed on the client devices such as personal mobile phones and IoT devices. It is noteworthy that FL differs from traditional distributed learning since we assume that training data — which is supposed to be sensitive — is maintained on the users’ devices they were generated on (e.g., data generated from users’ interaction with mobile applications, such as geographical data in location-aware applications [8]). Therefore, we have to deal with data that is quantitatively unbalanced and differently distributed over devices, i.e., each device data is not a representative sample of the overall distribution. Instead, in a traditional distributed setting, data has to be collected in a centralized location and then evenly spread over proprietary computing nodes. As a matter of fact, with FL, we leverage users’ computing power for training a shared ML model while preserving privacy, by actually decoupling the ability to learn a ML model from the need to store private data centrally.
In principle, a FL model is able to deal with fundamental issues related to privacy, ownership and locality of data [9]. In [7], authors introduced the FederatedAveraging (FedAvg) algorithm, which combines local stochastic gradient descent on each client via a central server that performs model aggregation by averaging the values of local parameters. To ensure that the developments made in FL scenarios uphold to real-world assumptions, in [10] the authors introduced LEAF, a modular benchmarking framework supplying developers/researchers with an abundant number of resources including open-source federated datasets, an evaluation framework, and several reference implementations.
Despite its potentially disruptive contribution, we argue that FedAvg has several significant shortcomings. First, the aggregation operation in FedAvg assigns the contribution of each agent proportional to each client’s local dataset size. This simplifying assumption ignores a wealth of other qualitative measures that can be impactful for training an effective global model. Examples of such measures include the number of sample classes held by each agent, the divergence of each computed local model from the global model — which may be critical for convergence [11] —, estimations about the agent computing and connection capabilities and finally the amount of client’s honesty and trustworthiness.
While FedAvg only uses limited knowledge about local data, we argue that the integration of the aforementioned qualitative measures and the expert’s domain knowledge is fundamental for increasing the global model’s quality. As a toy example, let us consider a federated scenario with just two users Alice and Bob. The photos in users’ mobile phones are the training samples of a ML model for classifying clothes. Let us suppose that Alice holds tens of very similar photos with the same outfit, and most of them are blurred. Thus these images do not carry much information. Instead, Bob holds a smaller number of well-labeled photos, but high-quality, and with a lot of different clothes. We argue that in such a situation, the weight of Bob’s contribution to the ML model should be higher or comparable to Alice’s.
The work at hand considerably extends the FedAvg approach [7] by building on two main assumptions: we can substantially improve the quality of the global model by incorporating a set of criteria measuring some quality properties of the clients, and properly assigning the contribution of individual update in the final model based on these criteria; the introduced criteria can be combined by using a score function able to “summarize” them; toward this goal, we assert about the potential benefits of using a prioritized multi-criteria score function over the identified set of criteria.
Moreover, the FL system proposed in this work moves one step forward from our previous effort in [12] along the following directions: (i) a formal notion of criterion is proposed and a formal classification for possible criteria is summarized (cf. Section 4.2), (ii) a mathematical formalization for score functions is described (cf. Section 4.3), (iii) the system is evaluated through more comprehensive experiments, in particular, we compare our approach against FedAvg both on MNIST dataset (with IID and Non-IID distribution), used in [7], and on CelebA dataset, where a classification task is performed by also injecting more explicit information about the quality of the local dataset and local model.
The remainder of the paper is structured as follows. In Section 2 we review some related work, while Section 3 is devoted to introduce the standard FL model and its key concepts. In Section 4 we provide a formal description of the proposed FL approach and we formalize both the concepts of local criteria and score function, while providing some general concepts for their choice. Section 5 details the experimental setup, as well as the datasets, the tasks and the metrics we consider for validating our approach and studying its variables. Section 6 presents results and discussion. Finally, Section 7 concludes the paper and discusses future perspectives.
Related work
Nowadays, any AI application requires capturing a considerable amount of users’ sensitive data about their routine activities. Moreover, users may be unaware of how their data are collected, processed, and transferred between parties [13]. Federated learning aims to meet the privacy ML shortcomings by horizontally distributing the training of the model over user devices, thus making clients exploit private data without sharing them [7].
Federated Learning ties in with other privacy-preserving ML approaches, aiming to equip ML algorithms with defense measures for protecting user privacy and data security. Beyond the recent FL paradigm, the field of privacy preservation in ML has been thoroughly investigated through the lens of (i) Secure Multi-Party Computation [14], (ii) Homomorphic Encryption [15], and (iii) Differential Privacy [16, 17], which enable different types of defense against privacy threats in ML.
Figure 1 highlights the differences between different learning approaches, including Federated Learning. Four architectures are compared, focusing on the information exchanged over the network: the solid lines represent training data flow, the dashed lines represent model parameters flow. In detail, we distinguish between the following learning paradigms from a distribution point-of-view:

Information flow over the network in four ML architectures. Solid lines represent training data flow, dash lines represent model parameters flow.
It is interesting to notice how FL was originally conceived as a cross-device architecture involving users’ mobile devices and edge device applications, e.g., for Google’s Gboard mobile keyboard [18]. Nonetheless, FL also extended towards cross-silo architectures [19], for multiple organizations willing to train a model in inherently sensitive domains (e.g., financial or medical). For example in healthcare, regulations prohibit medical institutions from sharing medical data (e.g., to improve medical imaging performance [20]); however, FL makes centralization unnecessary, by allowing data to remain isolated in the individual organizations while still improving medical AI models [3].
While handling users’ privacy concerns, FL faces challenges such as communication costs, unbalanced data distribution and device reliability and security issues (e.g., model poisoning [21], indirect information leakage [22], and Byzantine adversaries [23]). Despite its original formulation, in [13] federated learning concept is extended to a more comprehensive idea of privacy-preserving decentralized collaborative ML techniques, both for horizontal federations (where different datasets share the same feature space but are different in training samples) and vertical federations (where different datasets share the training samples but they differ in feature space).
In literature, a growing number of works focus on some modifications of FedAvg algorithm, e.g., adding a regularization term in the local objective functions for controlling the convergence of the network with statistical heterogeneity of data [11], changing the optimization algorithm [24]. Some other works focus on some adaptations for controlling the averaging operation on a per-layer basis [25] or for obtaining more personalization on the local models [26]. Finally, the work in [27] studies a protocol for selecting devices based on their dataset quality without any information disclosure.
The approach we propose is configured as a formal tool for incorporating any of the above-mentioned objectives (for example, facing the statistical challenges), when these could be obtained by pushing up or down the contribution of each client, based on some requirements. Such mechanism is completely incorporated in the weight of each client in the global aggregation. To the best of our knowledge, this is the only work attempting to modify the value of the weight of clients in the aggregation step.
In recent years, Federated Learning (FL) have been proposed by Google [6, 28] as a privacy-preserving architecture for training a ML model, by using private data from a large set of users who will never share them with other parties in the system. The motivation behind FL is mainly efficiency and privacy [7, 29]. Indeed, since users’ data never leaves the devices, FL models can be trained on private and sensitive data, e.g., the history of user typed messages, which can be considerably different from publicly accessible datasets.
Although FL principles can be applied to many machine learning tasks, throughout this paper we consider a fundamental problem in ML, classification task. Assume a training dataset
In the federated learning (FL) setting, there are two main players:
In the following, we briefly review the principles of Federated Learning by means of a formal definition, based on the scenario presented so far.
There exists an alternative for this formulation, in which the client a updates the received global model Θ and calculates a local version Θ
a
according to Eq. 3 and then it communicates the updated local model Θ
a
(instead of the local gradients) to S:
A well-known implementation of FL is FedAvg [7], in which the authors use the FL procedure to train a deep learning model using SGD for a classification task.
The FL principles discussed in Section 3 prevent the system from collecting information about users, giving rise to a natural trade-off between users’ data privacy and performance of the system. Our assumption is that revealing some non-sensitive client-related information (shared by each client a) and integrating this knowledge in the global aggregation step represented by Eq. 4 (performed by the server S) could lead to learning more effective federated model without harming users’ privacy. For instance, such non-sensitive data may carry useful information about specific domain or some marketing objectives that can be leveraged to build more in-domain strategies or increase the system profitability.
The proposed solution envisioned in this work is a client-aware FL strategy based on the following elements: we introduce the notion of criterion, which measures a specific property about users and their data, and we propose a formal classification for them; we propose the use of a score function to “summarize” the criterion measurements and compute a score for each client of the federation to be encoded in the aggregation step of each round of communication; we explore the benefits of a prioritized multi-criteria score function over the identified set of criteria;
In the following, we introduce the fundamentals and the formalism behind our approach. Table 1 summarizes the notation used throughout the paper.
Definition of variables used in this work
Definition of variables used in this work
Let us assume
The main idea of the work at hand is that we can encode the knowledge about clients in the weights w
a
used for aggregating client contributions in Eq. 2. Based on that, S can compute the weight w
a
for agent a according to the following equation:
In the following, we detail the main dimensions of the study related to the proposed approach, which include: identification of the set identification of a score function able to “summarize” the computed measurements in a score value representing each agent.
To properly study each of the presented dimensions, in the following we start discussing about the identification of criteria, then we go along the study of score functions and, in particular, priority-based score functions.
In the original FedAvg formulation, the server performs aggregation to compute w a , without knowing any information about participating clients, except for a pure quantitative measure about local dataset size.
However, a federated setting has to face some key issues [7] related to communication, connectivity and statistics. Among them, if we focus on the statistical aspects about the data, we notice that training data — which is typically the result of the real user usage of the device — represent a non-IID distribution of the training data over the large number of clients. Moreover, the amount of data is also unbalanced across the clients. Such characteristic may affect the accuracy and the efficiency of the resulting aggregated model [30]. In this scenario, a service provider may be interested in defining some statistical criteria such that the rounds of communication needed to reach the desired target accuracy are minimized. This result can be accomplished by enhancing the contribution of clients fulfilling the requirements expressed by the defined criteria.
For instance, useful information could be related, in a classification problem, to the number of classes covered by a local dataset. Moreover, a domain expert could ask users to measure their adherence to some other target properties (e.g. their nationality, gender, age, job, behavioral characteristics, etc.), in order to build a global model emphasizing the contribution of some classes of users; in this way, the domain expert may, in principle, build a model favoring some targeted commercial purposes.
We identified four classes of local criteria, each of them related to different aspects of the local dataset criteria describing the quality of the local dataset criteria describing the quality of the produced local model criteria describing the trustworthiness ofdevice a; criteria capturing the fulfillment of user a with respect to commercial targets (e.g., gender, job, nationality, etc.).
Although we can identify some classes of criteria, the choice for each particular criterion still remain a strictly task- and domain-dependent activity. We provide some insights about how to define such criteria: first, as a rule of thumb, one should start from the objective, e.g., obtaining faster convergence, overshadow unreliable updates, specialize the model with respect to some categories of users and then choose characteristics which can improve the model towards that initial objective. Next, the system designer could play with the identified criteria to test their effectiveness, e.g. with some previous centralized data, or with some synthetic data, or with some validation rounds of federated training. Finally, the empirical evaluations performed in this work suggest that choosing criteria that lead to higher variance in the score obtained across clients results in a better final model. For instance, the criterion dataset size is not an appropriate criterion, if all the clients have the similar number of samples in their private datasets.
For this reason, in this work we do not focus on a particular choice for them, but rather on presenting a formal approach about dealing with them. Moreover, we show in the experimental section — for illustrative purposes only — some examples of how they can be chosen with respect to the task.
Identification of a score function
A crucial aspect of this work is related to the identification of a score function f
m
able to “summarize” the values ci,a for each criterion
Over the years, a wide range of aggregation functions have been proposed in the field of information retrieval (IR) [31]. Just to mention the most relevant ones, we can refer to the weighted averaging operator, as well as to the ordered weighted averaging (OWA) models [32, 33] — which extend the binary logic of AND and OR operators, allowing representation of intermediate quantifiers —, to the Choquet-based models [34–36] — which are able to interpret positive and negative interactions between criteria —, and finally to the priority-based models [37].
Although we have many opportunities, we focus on the priority-based score function proposed in [37]. Such function can be formulated with a compactnotation as:
The main aim of such function, in a multi-criteria scenario, is to use a priority order over the involved criteria in order to assign each agent a a score based on the measurements ci,a, for each
The properties presented so far make this function our main choice for our approach. As an example, we may consider the case where the domain expert may wish to consider extremely important the age of a user rather than its dataset size, so that even a large local dataset would be penalized if the user age criteria is not satisfied.
In the following, we assume, for the sake of clarity of notation, that the set
In this section we describe the experimental setup — datasets, tasks and identified local criteria — used to validate the performance of the proposed FL system. In detail, we validate our approach on the following datasets and tasks:
For each dataset we consider different data distributions, as well as a specific model. Moreover, for each of them we define different criteria based on possible useful information we may extract.
MNIST experiments
We run a first cluster of experiments on MNIST dataset [38] for the digit recognition task. This dataset contains 60,000 examples of 10 classes of handwritten digits (plus a test set of 10,000 examples) by 500 writers. The samples are available in the form of 28x28 pixel black and white images.
We use this dataset in a completely user-agnostic way. Therefore, we created the federated dataset by following the two partitioning ways used in [7], in order to simulate both a IID distribution and a Non-IID distribution of data over the different clients.
Model. For the sake of comparability, we built the same classifier described in [7], i.e., a CNN with two convolution layers with 5x5 filters — the first layer with 32 channels, the second with 64, each followed with a 2x2 max pooling layer —, a fully connected layer with 512 units and ReLu activation, and a final softmax output layer with 10 neurons, for a total of 1,663,370 totalparameters.
Local criteria. For this experimental setting, we aim at both reducing the number of rounds of communication necessary to reach a target accuracy and making the global model not diverging towards local specializations and overfittings. To this aim, we extend the pure quantitative criterion in FedAvg [7] — dataset size — by leveraging two new criteria 2 .
The first criterion we consider is the one already used by FedAvg [7], namely the local dataset size (
The second considered criterion is the diversity of labels (
With respect to the third criterion, our aim is to reduce the negative effects of a non-IID distribution. Indeed, with non-IID distributions — and this is the case of our dataset — model performance dramatically gets worse [30]. Moreover, a large number of local training epochs may lead each device to move further away from the initial global model, towards the opposite of the global objective [11]. Therefore, a possible solution inspired by [11] is to limit these negative effects, by penalizing higher divergences and highlighting local models that are not very far from the received global model. We evaluate the local model divergence (
CelebA experiments
CelebFaces Attributes Dataset (CelebA) [39] is a large-scale dataset with 202,599 face RGB images of 10,177 celebrities. Each image comes with 40 binary attributes and differs from the other ones for celebrity pose and background. The task is a binary classification between smiling and non-smiling people, which is an information included within the 40 attributes. We chose this task since the smiling attribute has a good balance in the whole dataset between positive and negative outcomes.
We use this dataset in a completely user-aware way. Indeed, we suppose that each celebrity holds her own photos in her mobile phone gallery. This represent a realistic set of local datasets, which could have been generated from the personal device usage. This distribution is inherently Non-IID, therefore it is a representative scenario for a federated setting.
We used a subsampled version of the dataset (50% of images). Then, we removed users with less than 5 photos. For each user, we split her private dataset with random hold-out method with a ratio of 80/20. Finally, images have been resized to 64x64 pixels.
Model. For the CelebA experiments, we built a CNN binary classifier with two convolution layers with 3x3 filters — the first layer with 32 channels, the second with 64, each followed with a 2x2 max pooling layer —, a fully connected layer with 512 units and ReLu activation, and a final softmax output layer with 1 neuron, for a total of 8,409,025 total parameters.
Local criteria. Also in these experiments, our aim is to reduce the number of rounds of communication needed to reach a target accuracy. Three criteria have been used to reach such objective.
Also in this case, we keep the dataset size criterion (
Similarly to what has been done with the MNIST dataset, we want to consider how distributed are the labels of the local samples. In fact, since we are dealing with a binary classification task, we have at most two different classes in each dataset. For this reason, we consider a measure of class balance (
Then, the second criterion has been defined as
As mentioned above, this dataset comes with a set of binary attributes describing each image. Among them, we consider the blurriness as a non-sensitive information, so that the percentage of sharp images within the private dataset can be shared with the server. Therefore, we define
Implementation. We implemented our framework using Tensorflow [40] interface for Python. The operations and the communication between the participants of the federation (one server S and
Hyperparameters. We set the hyperparameters for the whole set of our experiments as follows, also guided by the results obtained in [7]. As for the FedAvg client fraction parameter, in each round of communication only 10% of clients are selected to perform the computation, so that
Evaluation. The classification model we built have been evaluated with respect to classification accuracy, i.e.,
This metric has been computed on each device over the private local test set. Based on LEAF framework [10] — which provides reproducible reference implementations and datasets for FL, as well as system and statistical metrics —, we estimate a global accuracy by averaging local accuracy values and weighting them based on local test set size.
Moreover, we improve the validation of the FL setting by using an approach which offers an overview of the whole training performances, instead of metrics describing a single round of communication. More specifically, we measure the number of round of communication required to allow a certain percentage of devices, which participate to the federation process, to reach a target accuracy (e.g., 75% or 80%), since this measurement is able to fairly show how effective and efficient is the model across thedevices.
In this section, we show and discuss the results of the experiments by considering both the evaluation approaches previously mentioned.
Tables 2, 4, and 6 show in the gray cells the number of rounds of communication required to the baseline (only DS) to allow certain fractions of devices to reach a specific percentage of the overall accuracy, for MNIST IID distributed, MNIST Non-IID distributed, and CelebA, respectively. In detail, as target accuracies, we have considered 70%, 80%, 90%, and 95% of prediction accuracy for MNIST datasets, and 70%, 80%, and 85% for CelebA dataset. The remaining rows of the tables show the gain in terms of rounds of communication of each experimental setting against the baseline.Therefore, we compute each row as the difference between the results of the dataset size criterion model, and the results of the corresponding model. The higher the positive value, the better is the approach compared to the baseline. Moreover, for each row and target accuracy, we show the average rounds of communication gained by considering all the fractions of devices.
For each target accuracy, we show how many training rounds are needed to grant that accuracy for 10% up to 90% of devices.
To have a comprehensive view of the effects of each criterion and their combinations, we consider the individual criteria and their combinations separately. Specifically, we first show the results of the experiments realized by exploiting a single criterion (i.e., in Eq. 6). Then, we show the results for the six priority permutations of the three criteria.
Moreover, we provide for each dataset the confusion matrices (Tables 3, 5, and ??) which summarize the classification outcomes (correctly classified vs. misclassified) of each experiment against the baseline (only DS criterion). They have been obtained by comparing the best outcome of each experiment in order to highlight the significance of results irrespectively of the accuracy alignment presented in Tables 2, 4, and 6.
Finally, we also show the results in terms of global test accuracy while considering the rounds of communication. In this respect, we have measured global accuracy as the average of the local test accuracy values weighted by the local test size.
In detail, Figures 2a, 2b, and 3, show these curves for MNIST IID distributed, MNIST Non-IID distributed, and CelebA, respectively.

Results for MNIST dataset. The plot provides the measure of test global accuracy during training for each round of communication (on horizontal axis).

Results for CelebA dataset. The plot provides the measure of test global accuracy during training for each round of communication (on horizontal axis).
In the following, we discuss the experiments by investigating the effects of the individual criteria.
MNIST with IID distribution. Table 2 and Figure 2a show the experimental results for MNIST considering IID distribution. The adoption of criteria that are different from the dataset size seems to have no effects, but some insignificant fluctuations. We expected this behavior since the dataset does not show statistical issues. In detail, the local datasets have been created by randomly picking samples from the original dataset. Consequently, we may represent the divergence of the local models from the original model as a zero-mean Gaussian noise. The same assumption holds for the label diversity. Furthermore, by the design of the dataset, DS is equal for every client, and we also expect that, on average, LD is the same for all clients. As a consequence, MW can not play a significant role, since all clients train the model with the same distribution of data, on average.
In detail, by looking at the differences between the two criteria LD and MW against the baseline DS, we may notice that the gain in terms of rounds of communication for 70%, 80%, and 90% target accuracy, considering all the percentages of clients, has an average very close to 0. Conversely, both LD and MW show slightly worse results than DS for a target accuracy of 95%, notably when we want to grant such accuracy to a very high number of devices (i.e., the last three columns of Table 2).
Lastly, the experiment shows that the effectiveness of criteria depends on the data distribution. The dataset shows how datasets built by randomly picking data from the original distribution do not generally need further statistical adjustments during the training phase. Therefore, we do not need to consider, in our approach, any particular statistical-based criteria. On the other hand, it is worth mentioning that the scenario is completely unrealistic in a federated scenario. Indeed, since each user privately generates data, they will not show the same distribution, and they will be significantly different in terms of statistical properties.
MNIST with Non-IID distribution. Regarding DS, LD, and MW, Table 2 shows that DS is the best criterion to reach the best overall system performance in a lower number of rounds. In detail, DS seems to be in trouble in the first rounds to achieve an acceptable degree of accuracy. In fact, if we look at 70% and 80% of the accuracy, LD and MW reach those goals faster. In this sense, we may notice an average of 1-2 rounds gained by LD and MW. However, if we need a higher degree of accuracy (90% and 95%), DS generally reaches that accuracy from 5 to 12 rounds before, on average. In this respect, we may notice that, without considering the last three columns of 95%, DS and LD show approximately the same performance, with an advantage of 0.5 rounds on average for LD. On the other hand, in the same scenario, MW is 3.6 rounds slower on average. Nonetheless, the remaining three last columns show an average advantage of DS of 12 and 16 rounds for LD and MW,respectively.
CelebA. Results in Table 6 show that the three criteria behave in a significantly different way. In this experiment, we note that CB performs much better than DS. To get an impression, we may notice CB reaches 85% of the overall accuracy for the 60% of the devices in less than half of the rounds of DS. Moreover, if we observe the CB row in Table 6, we may note that it reaches the same accuracy goals of DS much faster. In detail, CB reaches those goals 7.5 rounds before DS. On the other side, IS is generally slower than DS. Indeed, it generally needs 7.6 rounds more than DS. Another interesting insight is that the advantage of CB over DS progressively increases. In detail, CB shows an advantage of 8.6 rounds for the 70% of accuracy, 11.5 rounds for 80%, and 13 rounds for an accuracy of 85%.
Results for MNIST dataset with IID distribution for each experiment (the symbol ≻ denotes the priority relation) The row DS (baselines) provides the number of rounds of communication needed to make the percentage of devices specified in the columns reach the specified target accuracy. The remaining rows show the gain in the number of communication rounds with respect to the baseline (the greater the better). The column Avg. shows the average gain for the experiment with all the percentages of devices
Results for MNIST dataset with IID distribution for each experiment (the symbol ≻ denotes the priority relation) The row DS (baselines) provides the number of rounds of communication needed to make the percentage of devices specified in the columns reach the specified target accuracy. The remaining rows show the gain in the number of communication rounds with respect to the baseline (the greater the better). The column Avg. shows the average gain for the experiment with all the percentages of devices
Confusion Matrices for MNIST dataset with IID distribution for each experiment. For each matrix, the rows refer to the number of samples that are misclassified (✗), or correctly classified (✓). The columns denote the criteria that are compared with the baseline. The symbol ≻ denotes the priority relation
Results for MNIST dataset with Non-IID distribution for each experiment (the symbol ≻ denotes the priority relation). The row DS (baselines) provides the number of rounds of communication needed to make the percentage of devices specified in the columns reach the specified target accuracy. The remaining rows show the gain in the number of communication rounds with respect to the baseline (the greater the better). The column Avg. shows the average gain for the experiment with all the percentages of devices
Confusion Matrices for MNIST dataset with Non-IID distribution for each experiment. For each matrix, the rows refer to the number of samples that are misclassified (✗), or correctly classified (✓). The columns denote the criteria that are compared with the baseline. The symbol ≻ denotes the priority relation
Results for CelebA dataset for each experiment (the symbol ≻ denotes the priority relation). The row DS (baselines) provides the number of rounds of communication needed to make the percentage of devices specified in the columns reach the specified target accuracy. The remaining rows show the gain in the number of communication rounds with respect to the baseline (the greater the better). The column Avg. shows the average gain for the experiment with all the percentages of devices. Runs that did not reach the target accuracy for the specified percentage of devices in the 100 allowed rounds have been obtained considering such limit value
Confusion Matrices for CelebA dataset for each experiment. For each matrix, the rows refer to the number of samples that are misclassified (✗), or correctly classified (✓). The columns denote the criteria that are compared with the baseline. The symbol ≻ denotes the priority relation
In this section, we focus on the effects of the combination of the different criteria. Eventually, in Section 6.3, we discuss the different findings of the experiments, and we draw some general remarks.
MNIST with IID distribution. In the lower section, Table 2 shows the effects of the six priority permutations of the criteria. We may suppose that, since the individual criteria did not give any boost to the training, also their combination should have no effect. As an example, in some situations — and this is very clear in LD ≻ MW ≻ DS — we can observe a boost for lower target value of accuracy (70%, 80%, and 90%) and a slowdown in reaching the target accuracy of 95% for a high number of devices. Analogously to the individual criteria, we observe that the average gain is quite close to 0, for each target accuracy. The section 95% is an exception, since its last three columns show a slowdown up to 9 rounds of communication, on average. This behavior is probably due to the adverse effects we already observed for the individual criteria in this dataset.
We can also observe the same behavior in Figure 2a, where the curves of the combinations of criteria — and the ones of the individual criteria — are virtually indistinguishable.
MNIST with Non-IID distribution. Once we combine different criteria, we may observe several different behaviors. If we focus on Table 2, we can observe two different trends. First, if we analyze the data in the table moving from the left to the right, we may observe an increase of negative gains in the sections related to 90% and 95% of the accuracy. Even here, if we focus on the last three columns, we may notice an average delay that can reach up to 12 rounds. However, even in this case, there are some combinations that achieve an advantage over DS: LD ≻ DS ≻ MW, MW ≻ DS ≻ LD, and MW ≻ LD ≻ DS. LD ≻ DS ≻ MW shows an interesting trend since it reaches 70%, 80%, 90%, 95% of accuracy 2, 2, 0, 1 rounds sooner than DS, respectively. Here, the advantage is higher for the lower accuracy thresholds, while it is almost 1 for 95% accuracy. Instead, MW ≻ DS ≻ LD shows a better and incremental trend. Indeed, in this case, the advantage of rounds is -0.1, 0.2, 2.1, and 1.7, respectively. Finally, MW ≻ LD ≻ DS shows a bad advantage average for 70%, and 80% (0.2, and -0.6), while for 90%, and 95% it shows an interesting performance boost by reaching the goals 3.1, and 4.7 rounds before DS.
CelebA. Results in Table 6 show results which are coherent with the effects of the individual criteria. Rows DS ≻ IS ≻ CB, as well as IS ≻ DS ≻ CB, IS ≻ CB ≻ DS, and CB ≻ IS ≻ DS, denote the negative influence of the IS criterion. In particular, they show a little boost in very few cases for a low percentages of devices, but their averages generally show a slowdown from 2 up to 13 rounds of communication. On the other side, we may observe the effects of BC, which is beneficial even when combined with the other criteria. Even then, this effect is more evident when IS has the last priority (i.e., by minimizing its influence). In this case, we observe a substantial speed-up against the sole DS criterion, and also when compared with the CB individual criterion. In detail, permutation DS ≻ CB ≻ IS shows the best results for all target values of accuracy and all values of coverage for devices, up to 12 rounds of communications (against the sole DS criterion). Even Figure 3 clearly shows these behaviors. Specifically, we may observe the cyan line (DS ≻ CB ≻ IS) against the red line (DS only) and the blue line(CB only).
Discussion
The analysis in the previous section already shows the fundamental role of the identification of local criteria. By considering the underlying dataset, the different criteria heavily affect the training phase outcome.
As an example, when we create a federated dataset by randomly picking samples from an original dataset (i.e., with an IID distribution), the adoption of statistical criteria is not beneficial, since data, on local devices, shows the same expectation.
On the other hand, even in MNIST Non-IID distributed dataset, it is quite disappointing that we can not appreciate the benefits of introducing new criteria like label diversity. However, if we observe the dataset deeper, it contains local shards of the same size, and each client processes, at most, samples of two digits that make the number of classes in local datasets a piece of useless information.
Instead, the criterion on model divergence gives an initial boost to training, but it seems to have adverse effects for higher target values of accuracy. Even here, the behavior is interesting and predictable. Indeed, the penalization of significant divergences is the right choice because it helps to build a robust estimator on the average of the samples. However, at the end of the training, higher precision and fine training could require those differences.
As expected, when we observe the most realistic dataset, CelebA, we may appreciate more the effects of the proposed approach. In fact, it is a realistic Non-IID distribution of images where each client holds samples which are different in number and number of classes. Consequently, the attribute about class balancing gives the best results either when considered individually and also combined with the dataset size attribute. This is probably due to the differences both in dataset size and class balancing among the local datasets. Conversely, Sharpness is a theoretically useful attribute, but it results in a slowdown of performance during the training phase. We believe that the reason for this behavior is twofold. First, only a few images are marked as blurry in the dataset. Second, a right combination of sharp and blurry images has widely proven to be beneficial for the generalization of a CNN.
Conclusions and future perspectives
In this work, we have introduced a practical Federated Learning (FL) protocol that exploits non-sensitive client information to aggregate the local models. In detail, we have formalized the notion of a local criterion for clients in an FL scenario, and the notion of priority ranked criteria. By considering the ranking of the criteria, we have defined the score functions to weigh the contribution of a client. We have tested our approach on three well known federated learning datasets: IID-distributed MNIST, Non-IID-distributed MNIST, and CelebA.
The experiments show that we can substantially improve the standard federated learning approach by exploiting a properly defined set of local criteria. The approach is particularly effective when dealing with Non-IID distributions of data. This is remarkable since a real federated scenario will show a Non-IID distribution of data, rather than an IID distribution, where differences among clients are not so perceptible. In general, substantial differences among devices about a property make the corresponding criterion more effective. In practical terms, we have improved the federated approach by enhancing the individual differences, respecting the true federated learning spirit. In the experiments, for each dataset, we have investigated three different criteria that consider local datasets and local models.
However, there is still a broad spectrum of criteria that deserves to be explored. On the other hand, from the experiments, it also emerges that a criterion’s efficacy highly depends on the specific dataset/domain. Even though several considerations may lead the researcher, it is not possible to find a unique criterion that meets the needs of all the possible domains. Nevertheless, the study of the individual criteria has revealed some information about their impact on the training phase, which we hope may be useful for other researchers.
Next, we have observed how some criteria may be useful in some moments of the training, but they also may cause issues in others. In this respect, we propose a self-adaptive model that re-ranks the priority of criteria as partially investigated in [12]. We are particularly interested in this research direction and in the chance of leveraging this federated approach to other machine learning scenarios, like Recommender Systems [41, 42] and Deep Learning. Indeed, in those research fields, local model specialization and privacy are crucial, and this prioritized federated learning approach may be particularly beneficial.
Finally, currently the focus of our work has been to protect privacy of users and provide satisfactory predictions. In doing this, as our proposed system relies on a number of local criteria obtained from users, using such criteria may lead to some unfair or biased recommendation toward particular class of users (e.g., men v.s. women) or particular class of items (paid v.s. non-paid jobs). The reason is we do not control over how to use these features to protect user fairness and algorithmic biases. We deem this an important open research direction for which we plan to employ fairness-aware metrics as part of our approach [43, 44].
Footnotes
An agent represents a computational node of the system, usually corresponding to the private device of an user, which has the role of a client in the architecture. For this reason, throughout the paper, we will use the three terms clients, agents and users interchangeably.
Please note that we are not stating that the proposed ones are the only possible criteria. We present them just to show how the introduction of new information may lead to a better final model.
Acknowledgements
The authors wish to thank Domenico Siciliani for helping with the implementation of the framework. The authors acknowledge partial support from Innonetwork Apollon, Innonetwork CONTACT, Exprivia Digital Future, PON OK-INSAID, PON FLET.
