Decentralized collaborative machine learning for protecting electricity data

Abstract

In recent years, there has been a noticeable surge in electric power load due to economic development and improved living standards. The growing need for smart power solutions, such as leveraging user electricity data to forecast power peaks and utilizing power data statistics to enhance end-user services, has been on the rise. However, the misuse and unauthorized access of data have prompted stringent regulations to safeguard data integrity. This paper presents a novel decentralized collaborative machine learning framework aimed at predicting peak power loads while protecting the privacy of users’ power data. In this scheme, multiple users engage in collaborative machine learning training within a peer-to-peer network free from a centralized server, with the objective of predicting peak power loads without compromising users’ local data privacy. The proposed approach leverages blockchain technology and advanced cryptographic techniques, including multi-key homomorphic encryption and consistent hashing. Key contributions of this framework include the development of a secure dual-aggregate node aggregation algorithm and the establishment of a verifiable process within a decentralized architecture. Experimental validation has been conducted to assess the feasibility and effectiveness of the proposed scheme, demonstrating its potential to address the challenges associated with predicting peak power loads securely and preserving user data privacy.

Keywords

Electricity data privacy protection decentralized machine learning

1. Introduction

As the economy flourishes and the income levels of residents rise, there is a discernible increase in the share of electric energy within the realm of final energy consumption. Concurrently, the demand for electric power load is experiencing swift growth. The fundamental mission of a power system is to ensure a reliable supply of uninterrupted, superior-quality, and stable electric power to a diverse range of users. Within this framework, the precision of power load supply decisions is paramount, with power load forecasting being the cornerstone of these decisions [12].

Power load forecasting technology is a critical component within the power industry, playing an essential role in its development and day-to-day operations. Precise forecasting of power loads enables efficient predictions of future load levels, offering vital insights for the development of power system scheduling plans. This, in turn, contributes to the economic operation of the power system [17].

At present, forecasting technologies are predominantly implemented through machine learning algorithms. These algorithms analyze the relationship between historical power data features and load values to predict forthcoming loads [18]. However, data security emerges as a significant concern during this process. With the advent of big data and Internet technologies, an increasing number of consumers are employing smart devices, such as smart meters, to monitor their electricity usage. Regulators also use these devices to manage power loads. Nevertheless, this practice can lead to the risk of data leakage and privacy infringement. Malicious actors can potentially deduce sensitive information, such as the number of occupants in a household, their commuting patterns, and daily routines, by analyzing peak power usage data. This poses a substantial threat to individual privacy and societal stability [2]. Moreover, the enactment of national and international legislation aimed at safeguarding data security and personal privacy, such as the Data Security Law of the People’s Republic of China and the European General Data Protection Regulation (GDPR), imposes stringent policy and legal constraints on the collection and utilization of data [6].

Therefore, determining how to protect user electricity data while enabling power companies to forecast electricity loads is one of the current problems to be solved in the power industry.

Specifically, the use of user electricity data to predict electricity load faces the following problems:

With the awakening of privacy awareness and legal and regulatory constraints, users who generate electricity data are reluctant to transmit their data directly and do not allow their data to be inferred.

Smart electrical devices such as smart meters generate a large amount of data during their operation, but they do not possess the capability for data processing and storage. The common approach is to transmit this data to a central server for further processing or storage. However, as these data are generated in real-time and in large volumes, direct transmission can exert significant pressure on network bandwidth and the computational resources of the central server.

The server processing the data may be hacked or hijacked, posing a threat to the security of the data. In addition, a single server is prone to a single point of failure. Once the server fails, the entire system will come to a standstill, causing serious losses [13].

Researchers have conducted studies related to privacy and security issues in machine learning processes. For example, secure multi-party computation has been used to process data in an encrypted state where intermediate process data is not visible except to specific participants who have access to certain results [20]. This approach ensures privacy and security during data processing, but incurs a large communication overhead. Moreover, in practical applications, such as the utilization of power data, the data is generated in real time and in large quantities, while the computational power of the devices is limited, so this approach has obvious limitations. Another representative scheme is federated learning [11]. That is, the data do not leave the local area, but rather each client trains locally using limited local data, obtains a local model, and uploads the parameter gradient or model parameters to a centralized server. Then, the centralized server generates a global model by aggregating all the uploaded local models or gradients. This approach prevents data from leaving the local area and intuitively protects individual privacy. The architecture of federated learning is shown in Fig. 1.

Fig. 1.

Federated learning architecture.

In power data scenarios, the use of federated learning is a preferred option due to the more decentralized data generation sources. However, the following problems still exist:

Some researchers have stated that during the federated learning process, a malicious attacker can compute part of the raw data by observing the parameters of the local model, i.e., $w_{(t + 1)}^{1}$ in Fig. 1. Nevertheless, strict privacy protection requires that the local gradient generated by each client needs to be invisible during the federated learning process.

The trustworthiness of the federated learning server cannot be ensured, and a risk of a single point of failure exists.

In power scenarios, the terminals are devices with limited computing and storage capabilities, which are unable to perform local training tasks, such as smart meters [19].

To address these issues, researchers have tried to combine blockchain, secure multi-party computation, and edge computing. Kumar et al. [10] first used blockchain to validate the data and then utilized federated learning to train a deep learning model globally to improve recognition accuracy against CT images of COVID-19 patients. Qi et al. [16] used a blockchain-based federated learning framework to predict traffic and added noise to the model to strengthen privacy guarantees, where the model was verified by miners. This scheme can effectively prevent poisoning attacks, but the model validity is somewhat affected. Edge computing is also commonly used in cutting-edge research in machine learning, in which the utilization of edge nodes to offload computational and storage tasks from a central server can effectively improve training efficiency. Khelifi et al. [9] explored the applicability of deep learning models (i.e., convolutional neural networks, recurrent neural networks, and reinforcement learning) to IoT devices. The study sought to assess the future trends of deep learning and edge computing. The finding indicated that convolutional neural models can be used in the IoT domain and that reliable machine learning models can be trained even with data from complex environments.

2. Method

The aim of this study is to design a privacy-preserving decentralized and efficient training scheme for power load forecasting. The architecture of this scheme is shown in Fig. 2.

Fig. 2.

Scheme overview.

In Fig. 2, 1◯ refers to the appliances inside the home, such as refrigerators, televisions, water heaters, etc.; 2◯ refers to the smart meter, which is used to collect power data generated by the appliances; 3◯ refers to the edge computing device connected to the meter [4], and in this scenario, the home PC is assumed to act as an edge computing device, which is used to perform local training; and 4◯ refers to the blockchain [22], where edge computing devices act as nodes in a peer-to-peer network collaborating to execute the aggregation of models through smart contracts [21]. Synchronization and maintenance are performed using the blockchain through a consensus algorithm.

On the basis of the data flow, the flow of the scheme is described as follows:

User ${User}_{i}$ , $i \in {1, 2, \dots, n}$ uses household appliances to generate electricity consumption data ${Data}_{i}$ .

The smart meter collects electricity usage data ${Data}_{i}$ .

Edge computing node ${Node}_{i}$ uses ${Data}_{i}$ to train local model $M_{i}$ .

All edge computing nodes elect two aggregation nodes ${Agg}_{A}$ and ${Agg}_{B}$ by consistent hashing.

${Node}_{i}$ selects part of the gradient to encrypt and transmit to ${Agg}_{A}$ .

${Agg}_{A}$ and ${Agg}_{B}$ update the global model via a secure aggregation algorithm.

${Agg}_{A}$ uploads the updated global model and necessary verification information to the blockchain for all nodes to download and verify.

2.1. Aggregation node election

The hashing algorithm means that given an input, a string of fixed length (also called message digest) can be obtained, the same input can generate the same output, and different inputs can yield different outputs with high probability. However, the output cannot be calculated to determine the value of the input. The consistent hashing algorithm is to map all possible hash values to an abstract circle, where each point above the circle represents a hash value [8].

In this scheme, the nodes in the blockchain are uniformly mapped to the ring, assuming that the position of ${Node}_{i}$ in the ring is $Hash ({Node}_{i})$ . The latest block in the blockchain is taken, and a hash operation is performed to obtain $Hash (Block)$ . Nodes ${Node}_{i - 1}$ and ${Node}_{i}$ are searched such that $Hash ({Node}_{i - 1}) ⩽ Hash (Block) ⩽ Hash ({Node}_{i})$ , and ${Node}_{i}$ is selected as ${Agg}_{A}$ . Similarly, another hash operation is performed on $Hash (Block)$ to obtain $Hash (Hash (Block))$ . $Hash ({Node}_{j - 1}) ⩽ Hash (Hash (Block)) ⩽ Hash ({Node}_{j})$ is found, and ${Node}_{j}$ is chosen as ${Agg}_{B}$ . In this process, because the hash value of the latest block is not fixed and cannot be projected in advance, the two aggregation nodes can be regarded as random, which ensures the safety and reliability of the selection process. The election process is shown in Fig. 3. The corresponding algorithm is presented in Algorithm 1.

Fig. 3.

Aggregation node election.

Algorithm 1

Aggregation nodes election

2.2. Secure aggregation process

The technique used in the fifth step is homomorphic encryption with double trapdoors. The traditional homomorphic encryption scheme allows the user to perform operations on the encrypted data directly, so that the result of the computation can be decrypted to obtain the same result as the plaintext computation [3]. That is, ${Enc}_{P k_{i}} (m_{1}) ⊙ {Enc}_{P k_{i}} (m_{2}) = {Enc}_{P k_{i}} (m_{1} ⊙ m_{2})$ . The scheme requires that the plaintexts involved in the computation be encrypted with the same public key $P k_{i}$ and that the computation result can only be decrypted by the corresponding private key $S k_{i}$ . Our scheme encrypts the model gradient by using a double-trapdoor public key cryptosystem [14]. It allows each user to hold a different key, and ciphertexts encrypted with different public keys can be computed among themselves. That is, ${Enc}_{P k_{i}} (m_{i}) ⊙ {Enc}_{P k_{i}} (m_{j}) = Enc (m_{i} ⊙ m_{j}) = C$ and ${Dec}_{s k_{i} + s k_{j}} (C) = m_{i} ⊙ m_{j}$ . The algorithm has a strong trapdoor, and a strong private key constructed using the strong trapdoor can decrypt all ciphertexts encrypted by a weak secret key. That is, ${Dec}_{s_{strong}} (C) = m_{i} ⊙ m_{j}$ .

In our scheme, first, ${Node}_{i}$ divides all the local gradients into N blocks, selects the $j^{th}$ part $g_{i}^{j}$ , encrypts it with its own public key ${Enc}_{p k_{i}} (g_{i}^{j})$ , and uploads it to ${Agg}_{A}$ . In terms of the $g_{i}^{j}$ selection process, ${Node}_{i}$ have the option to randomly select a subset of gradient blocks $g_{i}^{j}$ to transmit to ${Agg}_{A}$ , a method that promotes data diversity and enhances the model’s ability to generalize. In certain scenarios, ${Node}_{i}$ can utilize gradient projection to evaluate the quality of their data and decide which gradient blocks to share with ${Agg}_{A}$ based on this evaluation. This approach assists ${Agg}_{A}$ in more accurately selecting ${Node}_{i}$ for model aggregation, thereby enhancing the efficiency and precision of the aggregation process. By leveraging these techniques, ${Node}_{i}$ can partition gradients effectively and select a subset for transmission to ${Agg}_{A}$ , enabling the accomplishment of federated learning objectives while safeguarding user data privacy and security. Since each block contains a segment of gradient information, this parameter can serve as a representation of the updates for the current round if suitable selection strategies and block quantity criteria are met. After receiving the encrypted part of the gradient, ${Agg}_{A}$ performs a randomized substitution of the part of the gradient, i.e., it completely disrupts it by $Shuffle (E_{p k_{i}} (g_{i}^{j}))$ , and then combines the disordered gradient with the disrupted model parameters in the same order $Shuffle (E_{p k_{i}} (m_{i}^{j}))$ . ${Agg}_{A}$ sends the disordered gradient and the model parameters to ${Agg}_{B}$ , which uses a strong key to decrypt and then update the global model parameters. However, because ${Agg}_{B}$ does not know the order of the original parameters, it is unable to recover the gradient order uploaded by the client to obtain the information. Afterward, ${Agg}_{B}$ encrypts the updated model parameters by using the client’s public key and then transmits them to ${Agg}_{A}$ , which organizes them into the normal order and packs them to upload them to the blockchain. The interaction flow between user and ${Agg}_{A}$ and the interaction flow between ${Agg}_{A}$ and ${Agg}_{B}$ are shown in Fig. 4 and 5, respectively.

Fig. 4.

Interaction between user and ${Agg}_{A}$ .

Fig. 5.

Interaction between ${Agg}_{A}$ and ${Agg}_{B}$ .

2.3. Aggregation verification

The update of each parameter in the model is essentially independent. The process of updating the model parameters is $θ_{t + 1} = θ_{t} + \frac{1}{n} α \cdot g_{1} + g_{2} + \dots + g_{n}$ , where θ is the model parameter, α is the learning rate, and $g_{i}$ is the gradient of the parameter θ calculated by user i.

Our scheme uses Pedersen commitments in cryptography, in which the commitment party chooses sensitive data m, computes the corresponding commitment c, and sends the commitment c to the verifier. Through the commitment c, the verifier can determine if c is computed from m, while the commitment party cannot replace m. Pedersen commitments also have their own unique additive homomorphism property, i.e., $COMM (\sum_{j} m_{j}) = \prod_{j} COMM (m_{j})$ .

Utilizing this property, our scheme designs a verifiable model aggregation method, in which ${Node}_{i}$ uploads its encrypted gradient ${Enc}_{p k_{i}} (g_{i}^{j})$ along with the commitment $COMM (g_{i}^{j})$ of the gradient. When ${Agg}_{B}$ finishes performing the aggregation operation, ${Agg}_{A}$ packages the result of the aggregation, $θ_{t + 1}$ , as well as the commitments of all gradients. All of them are packed into blocks and added to the blockchain. The aggregated result should satisfy the homomorphic property, i.e., $\prod_{i} COMM (g_{i}^{j}) = COMM (\sum_{i} g_{i}^{j})$ , and $\sum_{i} g_{i}^{j}$ can be calculated by $n (θ_{t + 1} - θ_{t}) / α$ . Thus, the user can verify whether ${Agg}_{B}$ has performed the aggregation correctly. The corresponding algorithm is shown in Algorithm 2.

Algorithm 2

Aggregation verification

3. Performance and evaluation

Previous Federated Learning with Homomorphic Encryption solutions primarily utilized established generic Homomorphic Encryption techniques without adequate optimization for computation overhead. This approach led to scalability issues in encrypted computation and communication during federated training, effectively limiting its applicability in real-world scenarios. However, in our approach, as described in Section 2.2, we require participants to divide their local gradients into N blocks, encrypting only a portion of them homomorphically. The aggregation node then decrypts these encrypted gradients to complete global parameter updates. This method effectively reduces computational costs while protecting the privacy of participants’ gradients. Related studies [5,15] have analyzed privacy leaks and proposed “partial transparency”, such as hiding parts of the model to limit adversaries’ ability to successfully execute attacks like gradient inversion attacks. Jin et al. [7] have also reduced computational overhead by encrypting the most sensitive parameters while ensuring the privacy of gradients.

To verify the effectiveness of the proposed scheme, we design experiments and obtain the experimental results.

3.1. Experimental environment

Our experiment uses the Household Electric Power Consumption dataset, which measures the electricity consumption of one household for almost 4 years at a sampling rate of 1 min. Different electrical metric quantities and some submetric values are included. The data comprise the values of reactive and active power in that user’s home in time, as well as specific values in certain scenarios, such as kitchens, laundries, water heaters, and so on. A total of 2,075,259 data were obtained from measurements during the period 2006–2010. This experiment assumes that these 2,075,259 pieces of data are held by 100 different organizations, with no duplication of data held by each organization. These 100 different organizations aim to jointly predict the electricity consumption of the customer at the next moment without disclosing part of the data they hold. Given that the data are time series, modeling is done using the LSTM algorithm. The CPU used is an 11th generation Core i7 with 16 GB of RAM. The experiment adopts the PyTorch deep learning framework to build the LSTM model, the SecretFlow framework to implement the model encryption and transmission, and the FISCO BCOS framework to execute the decentralized smart contract construction.

3.2. Experimental results

This experiment demonstrates the effectiveness of the program in terms of accuracy, communication volume, and time used for training.

Fig. 6.

Accuracy of our scheme.

Fig. 7.

Accuracy of centralized learning.

Figure 6 and 7 illustrate the gap between predicted and actual values of active voltage over time, with the red line being the predicted value and the blue line being the actual value. Figure 6 shows the experimental results of our scheme, and Fig. 7 depicts the results of centralized training, i.e., all the data are pooled together for traditional machine learning training. Compared with the centralized training, our scheme is slightly lacking in accuracy, but it is basically able to accurately predict the peaks and valleys, which in turn can be very helpful for electricity companies.

Fig. 8.

Loss of our scheme.

Fig. 9.

Loss of centralized learning.

Fig. 10.

Train loss achieved by the different methods.

Fig. 11.

Validation loss achieved by the different methods.

Figure 8 and 9 show the convergence of the training process. Specifically, Fig. 8 demonstrates the convergence of our scheme, and Fig. 8 shows the convergence of the centralized learning. Given that the training samples are dispersed across different devices, a certain loss of performance occurs in the aggregation process. Moreover, a slow rate of convergence is expected, but this slowing down is considered acceptable.

Furthermore, to investigate the effects of various privacy-enhancing techniques on model performance, we contrast federated learning without privacy measures (such as FedAvg) with federated learning incorporating differential privacy (DP) [1] alongside our proposed method. The experimental outcomes are illustrated in Figs 10 and 11. We present the training and validation loss achieved by different approaches when training LSTM models in a federated learning setting with 100 clients and a 20% participation rate. For the DP method, the clipping threshold C is set at 4.0, and the noise level $δ = 1 e - 5$ . While the convergence of the three methods on the training dataset, as depicted in Fig. 10, shows little variation, there is a significant difference in model performance on the validation dataset. This is primarily due to the partial gradient updating employed in our proposed scheme, which results in superior generalization compared to other methods.

We further calculate the communication overhead. The communication overhead for 1 round with 100 data holders participating in the training is 2.158 MB, which is 16.6 MB in traditional federated learning. The selective gradient uploading approach mentioned in Section 2.2 helps considerably by greatly reducing the number of parameters uploaded by each client.

4. Conclusion

We introduce a novel decentralized machine learning framework designed to protect the privacy of electric power user data. It employs a suite of technologies, such as smart contracts, consistent hashing, and homomorphic encryption, to achieve a secure and controllable training process without compromising privacy. We utilize this framework for electric power time series forecasting of power loads and validate the model training effectiveness on a dataset. The results demonstrate that, compared to other existing distributed machine learning solutions, our approach fully achieves the desired performance, particularly in terms of communication.

Footnotes

Acknowledgements

This work is supported by the Key Technologies for Sharing Value of Power Data Based on ‘Blockchain + Privacy Computing’ – Project 2: Research on Computing Technology for Sharing Power Data Based on Hybrid Privacy Computing (NO.52062623000C).

Conflict of interest

The authors have no conflict of interest to report.

References

Abadi,

Chu,

Goodfellow,

H.B.

McMahan,

Mironov,

Talwar and

Zhang, Deep learning with differential privacy, in: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, pp. 308–318. doi:10.1145/2976749.2978318.

M.S.

Abdalzaher,

M.M.

Fouda and

M.I.

Ibrahem, Data privacy preservation and security in smart metering systems, Energies 15(19) (2022), 7419. doi:10.3390/en15197419.

Acar,

Aksu,

A.S.

Uluagac and

Conti, A survey on homomorphic encryption schemes: Theory and implementation, ACM Computing Surveys (Csur) 51(4) (2018), 1–35. doi:10.1145/3214303.

Cao,

Liu,

Meng and

Sun, An overview on edge computing research, IEEE access 8 (2020), 85714–85728. doi:10.1109/ACCESS.2020.2991734.

Hatamizadeh,

Yin,

H.R.

Roth,

Li,

Kautz,

Xu and

Molchanov, Gradvit: Gradient inversion of vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10021–10030.

Isaak and

M.J.

Hanna, User data privacy: Facebook, Cambridge analytica, and privacy protection, Computer 51(8) (2018), 56–59. doi:10.1109/MC.2018.3191268.

Jin,

Yao,

Han,

Joe-Wong,

Ravi,

Avestimehr and

He, FedML-HE: An efficient homomorphic-encryption-based privacy-preserving federated learning system, 2023. arXiv preprint arXiv:2303.10837.

Karger,

Lehman,

Leighton,

Panigrahy,

Levine and

Lewin, Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web, in: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, 1997, pp. 654–663. doi:10.1145/258533.258660.

Khelifi,

Luo,

Nour,

Sellami,

Moungla,

S.H.

Ahmed and

Guizani, Bringing deep learning at the edge of information-centric Internet of things, IEEE Communications Letters 23(1) (2018), 52–55. doi:10.1109/LCOMM.2018.2875978.

10.

Kumar,

A.A.

Khan,

Kumar,

N.A.

Golilarz,

Zhang,

Ting,

Zheng,

Wang et al., Blockchain-federated-learning and deep learning models for covid-19 detection using ct imaging, IEEE Sensors Journal 21(14) (2021), 16301–16314. doi:10.1109/JSEN.2021.3076767.

11.

Li,

Fan,

Tse and

K.-Y.

Lin, A review of applications in federated learning, Computers & Industrial Engineering 149 (2020), 106854. doi:10.1016/j.cie.2020.106854.

12.

Lisin,

Shuvalova,

Volkova and

Strielkowski, Sustainable development of regional power systems and the consumption of electric energy, Sustainability 10(4) (2018), 1111. doi:10.3390/su10041111.

13.

Liu,

Ding,

Shaham,

Rahayu,

Farokhi and

Lin, When machine learning meets privacy: A survey and outlook, ACM Computing Surveys (CSUR) 54(2) (2021), 1–36. doi:10.1145/3436729.

14.

Liu,

R.H.

Deng,

K.-K.R.

Choo and

Weng, An efficient privacy-preserving outsourced calculation toolkit with multiple keys, IEEE Transactions on Information Forensics and Security 11(11) (2016), 2401–2414. doi:10.1109/TIFS.2016.2573770.

15.

Lu,

X.S.

Zhang,

Zhao,

He and

Cheng, April: Finding the Achilles’ heel on privacy for vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10051–10060.

16.

Qi,

M.S.

Hossain,

Nie and

Li, Privacy-preserving blockchain-based federated learning for traffic flow prediction, Future Generation Computer Systems 117 (2021), 328–337. doi:10.1016/j.future.2020.12.003.

17.

Son,

Yang and

Na, Deep neural network and long short-term memory for electric power load forecasting, Applied Sciences 10(18) (2020), 6489. doi:10.3390/app10186489.

18.

Veeramsetty,

D.R.

Chandra,

Grimaccia and

Mussetta, Short term electric power load forecasting using principal component analysis and recurrent neural networks, Forecasting 4(1) (2022), 149–164. doi:10.3390/forecast4010008.

19.

Zhang,

Xie,

Bai,

Yu,

Li and

Gao, A survey on federated learning, Knowledge-Based Systems 216 (2021), 106775. doi:10.1016/j.knosys.2021.106775.

20.

Zhao,

Chen,

C.-Z.

Gao,

Li and

Tan, Secure multi-party computation: Theory, practice and applications, Information Sciences 476 (2019), 357–372. doi:10.1016/j.ins.2018.10.024.

21.

Zheng,

Xie,

H.-N.

Dai,

Chen,

Weng and

Imran, An overview on smart contracts: Challenges, advances and platforms, Future Generation Computer Systems 105 (2020), 475–491. doi:10.1016/j.future.2019.12.019.

22.

Zheng,

Xie,

H.-N.

Dai,

Chen and

Wang, Blockchain challenges and opportunities: A survey, International journal of web and grid services 14(4) (2018), 352–375. doi:10.1504/IJWGS.2018.095647.