Parallel power load abnormalities detection using fast density peak clustering with a hybrid canopy-K-means algorithm

Abstract

Parallel power loads anomalies are processed by a fast-density peak clustering technique that capitalizes on the hybrid strengths of Canopy and K-means algorithms all within Apache Mahout’s distributed machine-learning environment. The study taps into Apache Hadoop’s robust tools for data storage and processing, including HDFS and MapReduce, to effectively manage and analyze big data challenges. The preprocessing phase utilizes Canopy clustering to expedite the initial partitioning of data points, which are subsequently refined by K-means to enhance clustering performance. Experimental results confirm that incorporating the Canopy as an initial step markedly reduces the computational effort to process the vast quantity of parallel power load abnormalities. The Canopy clustering approach, enabled by distributed machine learning through Apache Mahout, is utilized as a preprocessing step within the K-means clustering technique. The hybrid algorithm was implemented to minimise the length of time needed to address the massive scale of the detected parallel power load abnormalities. Data vectors are generated based on the time needed, sequential and parallel candidate feature data are obtained, and the data rate is combined. After classifying the time set using the canopy with the K-means algorithm and the vector representation weighted by factors, the clustering impact is assessed using purity, precision, recall, and $F$ value. The results showed that using canopy as a preprocessing step cut the time it proceeds to deal with the significant number of power load abnormalities found in parallel using a fast density peak dataset and the time it proceeds for the k-means algorithm to run. Additionally, tests demonstrate that combining canopy and the K-means algorithm to analyze data performs consistently and dependably on the Hadoop platform and has a clustering result that offers a scalable and effective solution for power system monitoring.

Keywords

Power load data abnormality detection and adjustment hybrid (CKMA)K-means algorithm (KMA)canopy algorithm (CA)Apache Mahout

1. Introduction

The electric load data will have many outliers due to the inherent uncertainty in the numerous signals that make up the electric power automation system’s information [1]. The reliability of past data is crucial to the success of power load characteristic analysis and load forecasting [2, 3, 4]. As a result, it is crucial to handle outliers in historical data properly. In [5] the fast decomposition orthogonal transformation state estimation algorithm is used to find and eliminate inaccurate readings. In [6] wavelet singularity detection is implemented to rectify and smooth the load readings. Furthermore, [7] implements the filter technique Kalman. In [8] proposes using a neural network for identification and adjustment to find and remove outliers from a large set of historical data, then subbing in predicted values where appropriate. The proposed power load separation technique is based on the data’s entropy [9, 10].

The small dataset is the US Department of Energy’s residential electricity load data published on the available website with energy-related information [11]. A dataset consisting of power load statistics for 936 residential consumers is recorded every 60 minutes, yielding 24 measurements per day, throughout the year. Electricity data makes up the huge dataset from a research Perspective Energy Experiment using Intelligent Meteres [12]. This includes every day’s information on the load curve for more than 6,000 consumers from 2009 to 2010. The large dataset’s power consumption is more intricate and varied than the tiny datasetsand is the most important application of data clustering [13, 14, 15, 16]. The k-means algorithm is used to simplify communication in the analysis of power load data for both tiers of a multi-layered clustering model [17, 18, 19]; local clustering of temporal data using adaptive k-means is subsequently imported into a larger model.

Reference [20] has developed a cutting-edge method for real-time monitoring of subsynchronous control interactions in power systems using Improved Intrinsic Time Scale Decomposition. This method provides critical insights into complex system interferences, crucial for maintaining system stability and performance. Complementing this reference [21] have made significant strides in autonomous vehicle technology by improving anomaly detection using a denoising variational transformer, which is key for the interpretability and reliability of self-driving cars. Furthermore, an extensive review of potential sensor data anomalies in autonomous vehicles, underlines the need for robust sensors to ensure vehicle safety and optimize efficiency [22, 23].

When weighted time-domain and frequency-domain data are combined, affinity propagation clustering is employed to produce the clustering results [24, 25, 26]. However, the literature [27] also provides a related divide-and-conquer strategy by combining the adaptive k-means and density peaks clustering techniques. The literature [28] suggested a hybrid two-step strategy for creating several sub clusters before combining those using K-Medoids. However, data reduction and multi-level clustering alleviate the big data curse at the expense of the intricate data structure of raw time series data. Mainly, substituting derived features for the original data reduces the interpretability of patterns found and nearly inevitably results in an inaccurate grouping. Irrational data segmentation may also lead to these unwanted outcomes for multi-level methods, mainly when applied first-level clustering. Because each starting data block must always determine a large number of cluster centers, some of which may be significantly different from the actual centers.

This loss of global information affects the analysis of the final clustering results. Therefore, it could be more sensible to increase the parallelism of clustering algorithms so that they can immediately accept the raw dataset as input. Adopt an improved algorithm for cluster data mining and the load characteristic curve abnormalities for inaccurate readings [29, 30]. In this paper, the authors proposed a negative selection technique that generates a detector set using negative selection to identify abnormalities rather than employing a dataset directly. Results show that the algorithm outperforms the conventional approach regarding prediction accuracy, low maintenance requirements, and convergence rate. While compared to the algorithm, which is superior to ways of predicting with neural networks, this one is more flexible and gives better results. Therefore, it is necessary to extract data that can represent the main content of the dataset to obtain the time-required results. This paper retains monitoring data in the power system with three categories of Consumer power consumption, Weather data, and Power generation and uses them as candidate feature data after deduplication [31, 32]. The calculation of this vector needs to be improved. In order to verify the effectiveness of the canopy $+$ K-means algorithm, it selects data sets, uses the Canopy with K-means and K-means algorithms to classify them, and pairs them based on purity, precision, recall and $F$ value. The clustering results of these two algorithms are evaluated. The Hadoop framework supports big data processing and administration in the distributed file system, known as (HDFS) Hadoop Distributed File System. Additionally, Hadoop’s MapReduce is built to function well with HDFS by relocating computation for data rather than the other way around, enabling Hadoop to achieve high data throughput [33, 34].

1.1 The contributions of this paper are as follows

Hybrid system development, it presents a system that combines Canopy and K-means algorithms to detect anomalous patterns in energy consumption.

The implementation of a detector provides a detector that precisely measures the level to which power usage is anomalous.

Instead of using high-dimensional daily load curves, focus on describing how customers’ power usage varies from day to day using daily load characteristics.

The hybrid approach appears to be superior to previous detection algorithms in identifying instances of anomalous power consumption.

The hybrid’s deployment can greatly decrease the time and materials needed to carry out assessments, resulting in cost savings. Instances where the system improves accuracy and makes the best use of available resources.

Thus, the hybrid has improved, and it is now more accurate and applicable in a wider range of power consumption terms.

1.2 The paper is organized as follows

The organization of the remaining sections in this paper is visualized through the following flowchart [35]:

2. Materials and methods

2.1 Implementation process of K-means and canopy algorithms under the hadoop platform

Hadoop’s scalable and reliable big data storage and processing infrastructure makes it ideal for managing massive data sets in a distributed setting [34]. When it comes to handling large amounts of data and applying machine learning, Hadoop and Apache Mahout are practically inseparable. When it comes to large data analysis, Hadoop provides the underlying distributed storage and processing infrastructure, and Mahout takes advantage of this by offering scalable machine learning algorithms [36]. Because of the Mahouts’ compatibility with Hadoop’s distributed computing features, machine learning methods may be applied to massive datasets. In order to execute machine learning algorithms on massive data efficiently, Mahout takes advantage of the parallelism and fault tolerance provided by Hadoop. Hadoop’s infrastructure is both scalable and dependable, making it ideal for processing massive amounts of data in a distributed setting. While Apache Hadoop is primarily used for its distributed processing capabilities, Apache Mahout is an open-source library that provides a collection of machine learning algorithms and tools such as clustering, classification, and collaborative filtering. With its superior capabilities and user friendliness, Apache Mahout is an attractive alternative to Hadoop. With these supplementary resources, Apache Mahout becomes a more compelling option for machine learning and data processing jobs than Hadoop alone. The advantages of using Apache Mahout rather than Hadoop are outlined in Table 1 [37, 38].

Table 1
The capabilities of Apache Mahout that are lacking in Hadoop

Facilities	Hadoop	Apache Mahout
Interface simplified for ML	Hadoop lacks a simplified ML interface	Apache Mahout provides a user-friendly ML
		interface
Rich ML algorithm library	Hadoop has limited ML algorithms	Apache Mahout offers a wide range of scalable
		ML algorithms
High-level abstractions	Hadoop requires low-level programming	Apache Mahout provides high-level abstractions
		for ease of use
Integration with ecosystem	Hadoop ecosystem integration is limited	Apache Mahout integrates well with other big
		data frameworks
Extensibility	Hadoop has limited extensibility	Apache Mahout allows for easy customization
		and extension

Dataset

The term “multi-level clustering techniques” refers to an approach used for multi-level data structure analysis. These methods can be applied to a sizable dataset from the Malaysian Electricity Department, specifically from the utility company “Tenaga National Berhad” (TNB), in the context of power consumption analysis. Residential electricity load statistics, including energy-related information, are assumed by TNB’s Department of Electricity. With information on the power load of 10,000 residential consumers, this dataset is significant. Every day for a whole year, the data is gathered at regular 60-minute intervals [12]. This indicates that there is precise information on the power load statistics of these residential consumers for each day within the chosen timeframe. The dataset was gathered as a part of a study or experiment that used smart metres to analyze energy consumption. Smart metres are instruments that can track and measure patterns in electricity consumption. In this instance, data from more than 10,000 users was gathered from 2015 through 2022 [13, 14, 15, 16]. For researchers looking to examine home electricity usage trends over a prolonged period, this huge dataset offers a plethora of data. Researchers can study several levels of data structure hierarchies and get insights into the patterns and trends within the dataset by using multi-level clustering algorithms. It’s crucial to remember that the dataset mentioned here is particular to a research experiment and might not be made available to the general public. Normally, anyone who is interested in accessing and analyzing this material must go through the proper channels, like working with the Malaysia Electricity Department or the Department of Electricity at TNB.

The K-means and Canopy algorithms can be implemented in the Hadoop environment thanks to the Apache Mahout programme. Create and deploy K-clusters with Canopy first. The K-means algorithm can be initialized with these k-clusters to produce the desired outcome in clustering. The following are the main steps depicted in Fig. 1’s schematic:

Preparation of Data, First, a Mahout vector format conversion of the power load dataset is performed.

Clustering of the canopy, Canopy is used to find clusters to begin with, and the results are saved in a file or folder.

Clustering using K-means and the Hybrid CKMA algorithms take their input from the results of the Canopy clustering stage. Hybrid algorithms’ output is kept distinct from the K-Means output.

Clustering of sensitivity peaks, K-Means and Hybrid clustering findings are utilized as input for the quick sensitivity peak clustering method, with the final results being written to a novel directory.

Accelerating clustering through the use of parallel processing involves dividing the task between several computers, or nodes.

The clustering procedures’ output is examined and visually shown so that outliers can be picked out.

The following schematic diagram is a simplified representation of the Apache Mahout architecture for identifying power load anomalies utilizing the Canopy, K-Means, and Hybrid algorithms.

Figure 1.

The process framework using Apache Mahout for detecting power load abnormalities based on Canopy with K-means and Hybrid algorithms.

2.2 Canopy algorithm

The Canopy algorithm is a fast approximate clustering technique. Its advantage is that obtaining clusters is very fast, and the result can be obtained by traversing the data only once. Because of this, the canopy calculates. However, the method needs accurate cluster results [39]. The basic process of the Canopy algorithm is as follows [40]:

Determine the two distance thresholds of the canopy, namely T1 and T2, where T1 $>$ T2.

Take any data object from the dataset and calculate the distance between it and all Canopy centres.

If the canopy does not currently exist, take the data object as a Canopy centre, and delete it from the dataset. Otherwise, go to (4).

If the distance of the data objects to a Canopy centre is within T2, add it to the canopy and delete it from the data set. Because the data object is close to this canopy, it can no longer serve as a hub for another Canopy.

Suppose the distance between the data object and a Canopy centre is within T1 outside of T2. The data object is also added to the canopy. However, the data object is not deleted from the data set at this time. This data object will participate in the next round of the clustering process.

Suppose the distance of the data objects to all Canopy centres is beyond T1. It is regarded as a Canopy centre and is deleted from the data set.

Repeat iterations (2) to (6) until all data objects in the dataset are divided into the corresponding canopy.

2.3 K-means algorithm

The core idea of the K-means algorithm is to iteratively divide all data objects into k clusters so that the objects in the clusters have high similarity. Furthermore, the objects between each cluster have a low similarity. The basic process of the K-means algorithm is as follows [41]:

Input the dataset $D$ and the number $k$ of clusters to be divided.

Select $k$ objects from $D$ arbitrarily as initial cluster centres.

Calculate the distance from any object in the cluster to the centre of each cluster. Moreover, assign it to the cluster where the closest cluster centre is located.

Recalculate the average of all data objects in each cluster as the new cluster centre.

Repeat (3 and 4) until the cluster centre does not change or the maximum number of iterations is reached.

3. Proposed Hybrid (CKMA) algorithm methods

It begins with preprocessing the dataset in this section of the proposed methodology. After identifying power load abnormalities in a parallel dataset, the data is written to the Hadoop Distributed File System (HDFS), converted to sequence files, and finally read from HDFS for processing and analysis. As a preparatory step in processing, the canopy algorithm is applied to the extracted data to select optimal source points. Then, using the Canopy output as a guide, the k-means clustering algorithm defines the most informative clusters within the Load dataset after representing the data as a feature vector. The dataset’s feature weight is based on the vector execution model to measure the between Parallel and Sequential. The clustering process of the canopy with the K-means algorithm is shown in Fig. 2.

Figure 2.

Proposed clustering process using Canopy with K-means algorithm.

3.1 Problem formulation

Each customer’s load curve matrix is as Eq. (1).

$\displaystyle L_{j}=[{L_{j}^{1},L_{j}^{2},\ldots,L_{j}^{T}}]$ (1)

Where $L_{j}^{d}({d=1,2,\ldots,T})=\{{L_{j,1}^{d},L_{j,2}^{d},\ldots,L_{j,m}^{d}}\}$ The customer j’s day-to-day load curve is indicated by d, where m indicated the daily loading curve set at the frequency of sampling, which could be 24, 48, 96, or values greater. A typical load curve is estimated using the customers’ daily average load Eq. (2).

$\displaystyle L_{j}^{\prime}=\left.\left\{{\mathop{\sum}\limits_{d=1}^{T}L_{j,% 2}^{d},\ldots,\mathop{\sum}\limits_{d=1}^{T}L_{j,m}^{d}}\right\}\right/T$ (2)

Five indicators can be used to describe the typical electrical usage behavior for every client $\textit{DLC}_{j}=({a_{j,1},a_{j,2},\ldots,a_{j,5}})$ , as shown in Table 2. Where $T$ is the time interval used for the statistics (such as one month or one year), this considers weekday daily load curves. Therefore, it needs to account for the fact that customers’ habits regarding power use vary significantly between Monday and Friday.

Table 2

Characteristics of daily load

Periods time	Characteristics of daily load	Symbols descriptions	Meanings in the physical world
The time: 00:00–24:00, for all day	Loading rate	$a_{1}=P_{av}/P_{\max}$	Reflect changes in load throughout the whole day.
	The minimum coefficient of load	$a_{2}=P_{\min}/P_{\max}$	Throughout the whole day, reflect the range of load changes.
Peak load hours are 08:00–11:00 and 18:00–21:00.	Loading rate over peak load hours	$a_{3}=P_{av}^{\textit{peak}}/P_{\max}^{\textit{peak}}$	Reflect load fluctuation throughout the peak load time span.
Regular load hours are 06:00–08:00; 11:00–18:00 and 21:00–22:00.	Loading rate over stable load hours	$a_{4}=P_{av}^{\textit{stable}}/P_{\max}^{\textit{stable}}$	Reflect load fluctuation throughout the stable load time span.
Valley load hours are 22:00–24:00 and 00:00–06:00.	Loading rate over valley load hours	$a_{5}=P_{av}^{\textit{valley}}/P_{\max}^{\textit{valley}}$	Reflect load fluctuation throughout the valley load time span.

The object p’s k-means neighborhood is defined as depicted in Eq. (2) using the definitions of k-means neighbourhood in Eq. (3).

$\displaystyle N_{k}(p)\{{o_{1}(p),o_{2}(p),o_{3}(p),\ldots,o_{k}(p)}\}$ (3)

Where, $K=|{N_{k}(p)}|$ , $K\geqslant k$ , element $p$ denotes the customer being detected. The matrix of local distribution of the element p can be written as Eq. (4).

$\displaystyle M(p)=[{\textit{DLC}_{01(p)},\textit{DLC}_{02(p)},\ldots,\textit{% DLC}_{0k(p)}}]$ (4)

The analysis of the eigenvalue of the matrix of correlation $CO({M(p)})$ of $M(p)$ can be computed once the matrix of covariance $CO(M(p))$ has been calculated as Eq. (4). Where $M(p)$ is a $K\times 5$ matrix, this step involves determining which k patterns of energy consumption that most closely resemble the element $p$ . The size of $k$ influences detection performance and accuracy. The following section of the research examines how the $k$ value improves the effectiveness of performance detection for various algorithmic techniques.

$\displaystyle CO({M(p)})=V({M(p)})\times D({M(p)})\times V({M(p)})^{T}$ (5)

Where the eigenvectors of $CO(M(p))$ are located in the columns of a $5\times 5$ matrix denoted by $V(M(p))$ . It can be shown that the diagonal matrix $D(M(p))$ is 5 by 5 in which diagonal component are its eigenvalues $({\lambda_{p,1},\lambda_{p,2},\ldots,\lambda_{p,5}})$ of $CO(M(p))$ . It is possible to depict the usual patterns of energy consumption of an object p and its k-means neighbors by building the matrix $[M(p),\textit{DLC}_{p}]$ . Applying Eq. (12), “Project matrix $[M(p),\textit{DLC}_{p}]$ into exploratory factor space”, as Eq. (6).

$\displaystyle Y^{h}({M(p)})=[{M(p),\textit{DLC}_{p}\times V^{h}({M(p)})}]$ (6)

Where,

$h$ column first of $V({M(p)})$ is $V^{h}({M(p)})$ .

Where, $h=1,2,\ldots,5$ correspond to the most significant $h$ eigenvalues.

The behaviour of element $p$ typical power usage characteristics is represented by $\textit{DLC}_{p}$ .

The local distribution matrix is reconstructed as Eq. (7).

$\displaystyle R^{h}({M(p)})=Y^{h}({M(p)})\times V^{h}({M(p)})^{T}$ (7)

Where $R^{h}(M(p))$ denotes a matrix of local distribution reconstructed based on the initial h principal component analysis, the find the element $p$ local reconstruction mistakes using the Eqs (8) and (9). In theory, the matrix reconstruction will be less successful as $h$ decreases since fewer principal components will be used to calculate the reconstruction residual. To mitigate the impact of varying $h$ on the local reconstruction error, a new factor, $\gamma_{h}(p)$ , is introduced as a multiplier. The value of $\gamma_{h}(p)$ indicates how many of the original $h$ principal components are present in the complete set of principal components.

$\displaystyle\textit{err}(p)=\mathop{\sum}\limits_{h=1}^{5}||\textit{DLC}_{p}-% r_{K+1}^{h}({M(p)})||\times\gamma_{h}(p)$ (8) $\displaystyle\gamma_{h}(p)=\frac{\mathop{\sum}\nolimits_{h=1}^{h}\lambda_{p,s}% }{\mathop{\sum}\nolimits_{s=1}^{5}\lambda_{p,s}}$ (9)

Algorithm Implemented: Hybrid (CKMA) Algorithm to detect anomalies for energy consumption types.
Input:	The matrix representing each customer’s load curve $L_{j}$ Finding threshold $\sigma$
Repeat	calculate $L_{j}^{\prime}$ , Eq. (2) reduce $L_{j}^{\prime}\to\textit{DLC}_{j}=({a_{j,1},a_{j,2},\ldots,a_{j,5}})$ (NOTE: Using KAM, CA, and CKMA) find the k-means neighbourhood $N_{k}(p)$ , Eq. (3) create the matrix of local distributions $({M(p)})$ using Eq. (4) estimate $CO({M(p)}),V({M(p)})$ , and $D({M(p)})$ , Eq. (5) project $[{M(p),\textit{DLC}_{p}}]\to Y^{h}({M(p)})$ , Eq. (6) reconstruct $Y^{h}({M(p)})\to R^{h}({M(p)})$ , Eq. (7) Until $h=5$ calculate $\textit{err}(\textit{LOSCKMA})$ , Eq. (10) Up until all of the data set’s items have been handled
Output	If the $\textit{LOSCKMA}(p)>\sigma$ value is greater than or equal to $p\Leftarrow 1$
Else	if the abnormal label is greater than or equal to $p\Leftarrow 0$

Each object’s local outlier score is calculated as Eq. (10). Where $\lambda_{p,s}$ denotes S, the $CO(M(p))$ matrix’s largest eigenvalue and $r_{K+1}^{h}({M(p)})$ represents a ( $K+1$ ) the row of matrix $R^{h}({M(p)})$ . The Local outlier score calculation as Element $p$ local reconstruction error is only compared to instances in its k-means neighbourhood. If item $p$ is a standard instance, its local reconstitution faults are lower than the local reconstruction errors of other samples in the neighbourhood determined by k-means. The local modernization error of sample $p$ is larger than the local reconstruction errors of the former samples of k-means neighbourhood if $p$ is sample of anomalous. As a result, $\textit{LOSCKMA}(p)$ Can show the variations between element $p$ and its neighboring values.

$\displaystyle\textit{LOSCKMA}(p)=\frac{\mathop{\sum}\nolimits_{i=1}^{K}\frac{% \textit{err}(p)\times\textit{dist}({p,o_{i}(p)})}{({o_{i}(p)})}}{K}$ (10)

Where, $\textit{dist}({p,o_{i}(p)})$ is the space separating element $p$ and its k-means neighbours $o_{i}(p)$ . In an effort to establish a threshold, we consider element $p$ to have abnormal energy consumption patterns if $\textit{LOSCKMA}(p)>\sigma$ [30]. Energy providers can set the threshold based on Tenaga Nasional Berhad (TNB) estimates. Hybrid (CKMA) Algorithm describes the precise algorithm approach.

3.2 Canopy with K-means algorithm

In general, the proposed execution model takes advantage of the hybrid Canopy-K-Means Algorithm (CKMA) within the Hadoop platform to efficiently process power consumption data. Initially, the power load data is formatted for compatibility with Mahout, followed by a rapid initial clustering using the Canopy technique to establish preliminary cluster centers. These centers inform the subsequent K-Means clustering, which refines these clusters for more precise. Further enhancement is achieved through fast sensitivity peak clustering, applying the insights gained from previous steps to better isolate. This multi-stage clustering process is performed in parallel, utilizing the distributed computing power of Hadoop to manage and analyze the large-scale data effectively. The final clustering results are then visualized, providing a clear depiction of normal versus anomalous consumption patterns. Throughout this process, performance is rigorously evaluated using metrics like purity, precision, recall, and $F$ -value, ensuring the model’s accuracy and reliability in identifying discrepancies in power usage, thereby supporting the development of a robust, efficient, and eco-friendly smart grid system. To solve the problem that the K-means algorithm cannot pre-determine the number of clusters and randomly select the initial cluster center points. This paper addresses the novel uses of the Canopy algorithm in the dataset of Power load abnormalities detected in parallel. Moreover, after obtaining the $k$ value, the class uses the K-means algorithm to perform clustering that uses the Canopy algorithm to select each canopy. The approximate center position is used as the initial center point of K-means to improve the clustering effect of the K-means algorithm. When performing Canopy clustering, set T1 $>$ T2 and the values of T1 and T2 are related to the average execution model of the dataset. In addition, it prevents the algorithm from falling into a local optimum due to the selection of canopy center points being too dense and selects a sample closest to the center of all sample points as the first Canopy center. The time required for sequential and parallel clustering process based on Canopy with K-means is shown in Fig. 3.

Figure 3.

Big data clustering process based on canopy $+$ K-means.

3.3 Evaluation of dataset clustering results

The clustering algorithm can be evaluated using internal, external, and relative validity evaluation criteria [42]. The clustering outcomes are evaluated in this article primarily using four external assessment criteria [43, 44].

Purity: It is an easy-to-understand evaluating indicator. It allocates every clustering to the document category with the most significant count frequently occurring in the cluster. It splits the number of docs that were appropriately assigned to the overall amount of docs N. to get the cluster by Eq. (11).

$\displaystyle\textit{purity}({{\Omega},C})=\frac{1}{N}\mathop{\sum}\limits_{k}% \max_{j}|{W_{k}\mathop{\cap}\nolimits C_{j}}|$ (11)

Where ${\Omega}=\{{W_{1},W_{2},\ldots,W_{k}}\}$ is the result of clustering, ${C}=\{{C_{1},C_{2},\ldots,C_{J}}\}$ is the category set, ${W}_{k}=({K=1,2,\ldots,K})$ and $C_{j}({j=1,2,\ldots,J})$ is a collection of data consisting of the dataset.

Precision: It measures the proportion of objects of a particular category in each cluster calculated by Eq. (12).

$\displaystyle P=\frac{TP}{TP+FP}$ (12)

Where TP (True-positive) true positive refers to the decision to correctly classify two similar data into the same cluster; FP (False-positive,) false positive refers to the wrong classification of two dissimilar data into the same cluster decision.

Recall: It measures the degree to which each cluster contains all objects of a particular category and is calculated by Eq. (13).

$\displaystyle R=\frac{TP}{TP+FN}$ (13)

Among them, FN (False-negative) false negative refers to the decision to classify two similar data into different clusters.

$F$ -Value: It is a clustering evaluation index that combines precision and recall and its calculation by Eq. (14).

$\displaystyle F=\frac{({1+\beta^{2}})PR}{\beta^{2}P+R}$ (14)

Among them, $\beta$ is the harmonic coefficient $\geqslant$ 1.

3.4 Hybrid clustering and evaluation in power load analysis

In the practical clustering project described, real power consumption data, inclusive of metrics such as peak and off-peak usage, is standardized and uploaded to the Hadoop Distributed File System (HDFS). Utilizing Apache Mahout, the Canopy algorithm swiftly determines initial cluster centers, exploiting Hadoop’s distributed processing for data scalability. These centers seed the K-Means algorithm within Mahout, which iteratively refines clusters. Evaluating these clusters with purity and $F$ -Value metrics not only underscores the utility of hybrid clustering in big data contexts but also illustrates a concrete method for anomaly detection in power usage, effectively bridging the gap between theoretical models and real-world applications. To exemplify clustering with a hybrid of K-Means and Canopy algorithms in Hadoop via Apache Mahout, we begin with data preparation, using the dataset from the Malaysian Electricity Department. Post vector format conversion for Mahout compatibility, the Canopy technique identifies preliminary clusters, simplifying the task for K-Means, which then refines these clusters. Through parallel processing in Hadoop, the clustering operation is expedited across multiple nodes. Post-clustering, purity and $F$ -Value metrics are used to assess clustering quality and to pinpoint power consumption outliers or anomalies.

However, this example serves to illustrate how to apply these evaluation metrics in practice

The plot shows 8 observations that have been clustered into 3 groups. Each observation is represented by an ‘x’ and is colored according to the cluster it has been assigned to, with the color bar on the right indicating the cluster numbers.

This clustering could have been the result of a hybrid clustering approach using K-Means initialized by Canopy cluster centers. In an actual application, the observations would represent data points with features extracted from the power load data, and the clusters could represent different typical load profiles or potential anomalies.

The plotted data is generated randomly for this example, and the clustering is performed using the K-Means algorithm. In practice, the Canopy algorithm would first be used to quickly generate rough clusters that serve as initial centroids for the K-Means algorithm, which then refines the clustering. This two-step process is particularly useful for large datasets as it can significantly reduce the time K-Means would otherwise take to converge on large datasets.

The visualization aids in interpreting the clustering results, where the proximity of points and their colors represent the grouping determined by the algorithm. In the context of power load analysis, such clustering might help in identifying patterns of usage that correspond to normal behavior or various types of anomalies or inefficiencies.

Given the previous plot details, we will create a hypothetical example with 8 observations and 3 clusters

In the provided clustering plot, outliers or abnormal data points are typically those that lie a significant distance away from the clusters of other points. Outliers can be identified as points that do not group well with any cluster or are far away from their cluster centers.

Based on the figure, there do not appear to be any significant outliers. All points seem to be relatively close to others within the same color group, indicating they are close to their respective cluster centroids. However, the point in the cyan color at the bottom left (with the lowest values on both Feature1 and Feature2) could potentially be considered an outlier within its cluster due to its distance from the other points in the same cluster.

It’s important to note that outlier detection depends on the context and specific criteria used to define what is considered “normal” within the dataset. In a power load dataset, for instance, what constitutes an outlier would depend on typical load profiles, the expected variability in power usage, and other domain-specific factors. In statistical terms, a point may be considered an outlier if it lies more than 1.5 times the interquartile range (IQR) below the first quartile or above the third quartile of the dataset.

4. Experimental set-up and results

4.1 Experiments setup

The experiments provided an overview of a setup for constructing a parallel data mining Also this experiment demonstrated an intelligent cloud computing framework for smart monitoring of power systems used a hybrid of canopy clustering and K-Means clustering with Apache Mahout.

For dataset preparation, obtained a power load dataset and preprocess in a compatible format by normalizing. For Hadoop Cluster Configuration Hadoop clustered to handle the size of your dataset and the computational requirements. ForApache Mahout Installation Installed and configured Apache Mahout on the Hadoop cluster and implemented the fast-density peak clustering algorithms, using Mahout’s MapReduce framework. Also evaluated the performance of algorithms for power load abnormalities detected depended on the parameters, such as the number of clusters and distance thresholds. After that the input data was in a formatted Hadoop Sequence File and submitted the MapReduce job for the algorithms to the Hadoop cluster. For the performance evaluation wasRetrieved the clustered results generated by algorithms from the output paths specified in the MapReduce job to evaluated the performance of the algorithm in terms of clustering quality, and detection accuracy metrics. Assessed and analyzed the effective approach results to gain into power load abnormalities by comparing the results.

4.2 Experimental and discussion

The well-known and logical technique used for the simplicity and effectiveness of solving clustering issues partitioning-based clustering when dealing with unlabeled data (i.e., data without defined categories or groups) is K-means, as shown in Fig. 4 [45, 46].

Figure 4.

Evaluation of the clustering effect of the Canopy, K-means, and Canopy $+$ K-means algorithms on data classification HDFS.

There are two models to enhance the clustering of big data time by comparing two algorithms Canopy and K-means based on Sequential and Parallel models as followings:

Sequential Model: The k-means algorithm is used in serial mode [47, 48], the first implementation mode. A single machine performs every step of the k-means algorithm in this mode.

Parallel Model: The parallel algorithm can be divided into control parallel and data parallel. Data parallel works by breaking down a large dataset into smaller ones, each receiving the same processing. The data parallel can be run on a single machine with multiple threads or distributed across a network of computers (nodes). The control parallel performs a different set of calculations for each small dataset. The time required for canopy and k-means clustering to complete the clustering process is measured in these studies. Several of the nodes we use in our tests have limits that we have established. When an algorithm runs sequentially, it is treated as though a single node’s processors were processing it. The total number of available nodes caps the number of available processors for parallel processing. Dataset clustering times and details for sequential and parallel implementations with three clusters groups are shown in Fig. 5.

On the other hand, the canopy takes approximately for sequential (0.36803) second to perform the parallel (0.36118) on multiple nodes. At the same time, k-means it takes for sequential around (0.0099415) second to perform the parallel (0.0083332) k-means algorithm on multiple nodes without canopy. However, canopy with k-means with canopy takes about (0.0081674) second to perform the parallel (0.0073397). Time and node requirements for implementing parallel k-means via Hadoop were presented, along with the results. The time needed to complete the k-means technique might be reduced by using canopy clustering as a preliminary step. Because of this, we have shown that the k-means algorithm in parallel mode works well with the canopy method, and the outcomes depend solely on the size of Hadoop clusters.

Figure 5.

Time comparisons between running the Sequential & Parallel of the Canopy, K-Means and Canopy with k-means algorithms.

4.3 Evaluation metrics

The evaluation metrics used to assess the performance of the clustering algorithms are purity, precision, recall, and $f$ -value. These metrics measure the accuracy of the clustering results by comparing the generated clusters’ labels with the ground truth labels. Purity measures the proportion of data points that are correctly classified into their true cluster. Precision measures the proportion of true positives among the data points assigned to a cluster. Recall measures the proportion of true positives identified among all the actual data points belonging to a cluster. The $f$ -value is a proportional average, which offers a balanced measurement of precision as well as recall. For $k$ values ranging from 30 to 120, we assessed the effectiveness of the Canopy, K-Means, and hybrid Canopy $+$ K-Means clustering algorithms. Table 3 displays the evaluation findings and includes the cleanness, precision, recall, $f$ -value, and $k$ -value for each algorithm. The assessment findings show that in terms of purity, precision, recall, and $f$ -value, the hybrid Canopy $+$ K-Means algorithm surpassed the Canopy and K-Means algorithms individually. The accuracy and significance of the clusters produced were enhanced thanks to the hybrid algorithm’s capacity to integrate the advantages of the two techniques. As evidenced by the greater purity, precision, recall, and $f$ -value for bigger $k$ values, the evaluation findings further demonstrate that adding more clusters generally enhances clustering performance.

The assessment criteria employed in this research have shown that the hybrid approach and the Canopy and K-Means algorithms are effective for clustering the dataset on the power system. The hybrid Canopy $+$ K-Means algorithm has proven to perform better in terms of precision and the usefulness of the clusters that are produced. The evaluation findings offer insightful information about the algorithms’ clustering capabilities and can be used to direct the selection of suitable clustering techniques for similar datasets.

Table 3
Hybrid approach, K-means, and Canopy algorithms evaluation metrics for each K range from 30 to 120

$K=$ 120
Algorithm	Purity	Precision	Recall	$F$ -Value
Canopy	0.74848	0.84069	0.83915	0.8466
K-means	0.85743	0.87575	0.89693	0.88833
Hybrid	0.91419	0.99647	0.92741	0.94978
$K=$ 90
Canopy	0.67725	0.84069	0.83915	0.76222
K-means	0.77779	0.89501	0.83918	0.70597
Hybrid	0.91419	0.93844	0.91818	0.90346
$K=$ 60
Canopy	0.82719	0.82404	0.84759	0.81748
K-means	0.85514	0.88338	0.86701	0.80745
Hybrid	0.89609	0.9171	0.90904	0.90346
$K=$ 50
Canopy	0.96106	0.99229	0.98475	0.9787
K-means	0.9689	0.98832	0.90538	0.98964
Hybrid	0.97072	0.99229	0.95565	0.98854
$K=$ 40
Canopy	0.82048	0.88716	0.8287	0.51201
K-means	0.82114	0.86612	0.81178	0.86689
Hybrid	0.92509	0.98552	0.92506	0.90321
$K=$ 30
Canopy	0.87353	1.00	0.87915	0.88622
K-means	0.85074	0.83366	0.79635	0.80176
Hybrid	0.9849	1.00	0.93351	0.91321

4.3.1 Precision versus recall larger dataset

The hybrid Canopy with K-Means algorithm exceeds previous versions Canopy and K-Means algorithms with regard to precision and recall. Furthermore, there is a consistent upward and downward trend between the three algorithms’ Precision-Recall (PR) curves. Raising the detection threshold increases the number of outliers the algorithms can pick up without compromising precision. In order to catch more out-of-the-ordinary consumptions, one could lower the judgement threshold, although doing so would reduce detection accuracy. We compare the three approaches’ performance at the sweet spot where Precision and Recall are equal by analysing the crossing points of their respective PR curves on the massive dataset. Finding trends in energy consumption becomes more challenging as the information grows larger and more users with varying energy habits are included. The relative performance curves of PR of the three algorithms are compared for big datasets in Fig. 6a–c. Area Under the Curve (AUC) data show that the hybrid Canopy $+$ K-Means-based detection method has better detection performance than either Canopy or K-Means alone. The hybrid Canopy $+$ K-Means based detection algorithm has Areas Under AUCs of 0.9171 $k=$ 60, 0.93844 $k=$ 90, and 0.99647 $k=$ 120. K-Means AUCs range from 0.8833 $k=$ 60; to 0.89501 $k=$ 90; to 0.87575 $k=$ 120. Canopy’s area under the AUC is 0.82404 at $k=$ 60; 0.84069 at $k=$ 90; and 0.84069 at $k=$ 120.

Figure 6.

Comparison PR between Canopy, Hybrid and K-means using the large datasets.

4.3.2 Precision versus recall small dataset

Figure 7a–c compares the PR curves of the three detection techniques for small datasets and the results of a PR experiment with these algorithms. Higher Area Under the Curve (AUC) values reveal that the hybrid Canopy and K-Means-based detection method outperforms the individual Canopy and K-Means techniques. The hybrid Canopy and K-Means algorithm has an AUC of 1.00 $k=$ 30; 0.98552 $k=$ 40; and 0.99229 $k=$ 50. K-Means AUCs range from 0.8336 $k=$ 30: to 0.86612 $k=$ 40 to 0.98832 $k=$ 50. Canopy’s AUC ranges from 1.00 $k=$ 30; to 0.88716 $k=$ 40; to 0.99229 $k=$ 50. In terms of detection performance, the Hybrid detection method is clearly superior to the Canopy and K-Means algorithms, regardless of the value of k. Points on curves move as the amount of detection the threshold rises. As the detection threshold rises, the Canopy and K-Means algorithms both exhibit a precipitous drop-in rate PR. While Canopy and K-Means use different detection strategies, the suggested method is able to discover ROCs that are out of line with the features of the data distribution.

Figure 7.

Comparison PR between Canopy, Hybrid and K-means using the small datasets.

Figure 8.

Canopy, Hybrid and K-means based on ROC curves with different $k$ examination in a larger dataset.

4.3.3 True position versus fault position larger datasets

By contrasting the actual and predicted fault locations, we were able to evaluate the performance of the clustering techniques. The detection method’s efficiency is highly sensitive to the choice of k. Figure 8 illustrates the ROC curves for different $k$ values for the Hybrid technique, Canopy, and the K-Means detection algorithms, and Table 4 displays the relevant Area under the Curve (AUC) values. The ROC curves and AUCs of the proposed method on the small dataset are relatively invariant across different values of k. Both the True Positive Rate (TPR) and the False Positive Rate (FPR) are proportional to the sum of the actual and estimated positive samples. Both the Canopy and K-Means detection methods have significantly reduced accuracy when k is minimal. Results show that the value of k affects both algorithms more heavily than any other parameter.

The outcomes indicate that the Hybrid algorithm outperforms the Canopy and K-Means methods in detecting aberrant electricity usage patterns, even for huge datasets. The ROC curves show that the Hybrid algorithm has improved detection performance due to greater TPR values at lower FPR levels. In addition, the optimum points of the canopy algorithms are relatively close to one another, unlike the optimum point of the K-Means method. The Hybrid, K-Means and Canopy approaches have a greater recall value at the state of optimum point than the K-Means algorithm, indicating that they are more accurate at detecting true positives. It’s important to note, though, that the proposed approach may pose greater challenges than the other two algorithms for consumers who use electricity in unconventional manners. This indicates that the proposed method may require further development or customization for certain applications, such as identifying anomalous electricity usage patterns for customers with specific energy consumption habits. Thus, it appears that the Hybrid technique has potential for improving the precision and efficiency of detecting unusual patterns in electrical consumption. Additional research in this area may have significant benefits for the energy industry and its customers. When compared to the Canopy and K-Means methods for $k=$ 60, $k=$ 90, and $k=$ 120, the TPR against the FPR curves at the equilibrium point of the ROC curve for the Hybrid technique are shown in Fig. 8a–c.

Table 4
Large dataset, the AUC of Canopy, Hybrid and K-means with different $k$ values tested

K	AUC %
	Canopy	K-means	Hybrid
60	0.82404	0.88338	0.9171
90	0.84069	0.89501	0.93844
120	0.84069	0.87575	0.99647

Figure 9.

CKMA, CA and KMA based on ROC curves with different $k$ examination in a smaller dataset.

4.3.4 True position versus fault position small datasets

Figure 9a–c shows that the TPR is greater than the FPR at the sweet spot of the ROC curve when $K=$ 30, $K=$ 40, and $K=$ 50. It’s also worth noticing that the optimal targets for the Hybrid Canopy, and K-Means methods are all rather similar, while the recall value of the Hybrid approach is greater. It is clear from this comparison that the proposed algorithm is the more difficult of the three in terms of its potential difficulty in recognising clients with abnormal electricity usage patterns. When compared to the two other approaches of detection, the recommended method has a higher area under the ROC. These findings lend credence to the Hybrid technique as a viable strategy for detecting instances of excessive energy use. The AUC values for each ROC curve are shown in Table 5.

Table 5
Small dataset, the AUC of Canopy, Hybrid and K-means with different $k$ values tested

K	AUC %
	Canopy	K-means	Hybrid
30	1.00	0.83366	1.00
40	0.88716	0.86612	0.98552
50	0.99229	0.98832	0.99229

Table 6

Comparison between proposed study and literature methods

K	AUC %
	Current study			Benchmark
	Canopy	K-means	Hybrid	“LMR”	“LOF”	“GKLOF”
20	–	–	–	1	0.7182	0.7108
30	1.00	0.83366	1.00	1	0.8078	0.7691
40	0.88716	0.86612	0.98552	1	0.9862	0.9998
50	0.99229	0.98832	0.99229	1	0.9986	0.9999
60	0.82404	0.88338	0.9171	0.8251	0.6665	0.5818
80	–	–	–	0.8405	0.6963	0.6376
90	0.84069	0.89501	0.93844	–	–	–
100	–	–	–	0.8126	0.7395	0.7584
120	0.84069	0.87575	0.99647	0.8463	0.7804	0.8297

4.3.5 Comparison current research to benchmark research

Based on the results of our literature review, we have compared several popular algorithms. In this study, we tested how well a method that combines Canopy and K-Means for base detection could perform. We compared this method against both Canopy and K-Means separately. Three algorithms – Local Matrix Reconstruction (LMR), Local Outlier Factor (LOF), and a Gaussian Kernel Function Improved LOF Algorithm (GKLOF) – were used in the studies we reviewed. Our results are summarized in Table 6, which compares the Canopy, K-Means, and Hybrid algorithms to the LOF, LMR, and GKLOF algorithms across a variety of criteria. Precision values achieved for different values of k clustering for both small and large datasets are listed in Table 6. In this work, we focus on contrasting the efficiency of the Canopy and K-Means algorithms with that of the hybrid-base detection technique, Canopy $+$ K-Means. We also took into account findings from studies that made use of the GKLOF, LMR, and LOF algorithms already present in the literature [49].

1. Small data performance

According to the literature benchmark, the hybrid algorithm outperforms the Canopy and K-Means algorithms in terms of performance detection. This shows that the hybrid algorithm has a stronger ability to discriminate between normal and aberrant patterns of electricity usage in limited datasets. The LMR-based detection algorithm, on the other hand, is highlighted in the literature benchmark and consistently achieves a precision score of 1.00 for any value of k in the short dataset. This shows that anomalous consumption patterns are consistently and precisely detected by the LMR algorithm.

2. Large data performance

Even when used on big datasets, the hybrid method continues to outperform the Canopy and K-Means methods. This suggests that even when dealing with a larger population of clients with varied power usage patterns, the hybrid algorithm’s performance is still reliable. In addition, a review of the literature shows that the performance of the Grid-basis on K-Neighbor GKLOF, LOF, and LMR algorithms varies when handling large datasets. Performance for the LOF approach falls more precipitously than for the LOF method, while the GKLOF algorithm performs fairly consistently. Our research shows that for both small and large datasets, the hybrid-based detection approach outperforms the canopy and K-Means techniques. The usefulness of the LMR-based detection approach, especially for small datasets, is highlighted by these results, which are consistent with earlier research. They also show the shortcomings of the LOF and GKLOF algorithms, particularly when dealing with big datasets and small $k$ values. The hybrid approach consistently beats the canopy and K-Means algorithms in terms of tiny datasets, regardless of the value of k, by obtaining higher AUC values. The hybrid approach, however, continues to perform well in huge datasets, while the LMR and LOF algorithms show a decline in performance and the GKLOF algorithm’s performance is largely consistent. Our results indicate that the hybrid-based detection approach generally outperforms the LMR, LOF, and GKLOF algorithms utilised in the literature, especially in circumstances involving limited datasets Table 6.

5. Conclusion

In order to find unusual patterns in power consumption, this paper offers a hybrid- technique that combines the Canopy and K-means algorithms. The results of this investigation have provided several important observations. First, a powerful optimal that measures the level of abnormality in each sample of power consumption has been created and put into practise. It does this by using the Hybrid algorithm at the neighbours level. Second, the study focuses on characterizing daily load characteristics rather than extremely dimensional daily loads, resulting it simpler to compare load data gathered at various rates of sampling. Thirdly, the suggested approach outperforms the previous two algorithms for detection with regard to precision in detection and parameter sensitivity. It might considerably cut down on the time and materials needed to conduct inspections while increasing overall accuracy. The necessary algorithms and threshold settings can also be used to determine the threshold amount for abnormal line losses for a power system. Enhancing our comprehension of threshold creation in various circumstances should be the goal of subsequent studies in order to advance the area.

Footnotes

Acknowledgments

The researchers would like to thank the Research lab of the FTSM-UKM, in Malaysia and the University of Fallujah in Iraq for their assistance.

Conflict of interest

The authors declare no conflict of interest.

Funding

Universiti Kebangsaan Malaysia (UKM) financed this research UKM Grant Code: FRGS/1/2021/ICT07/UKM/02/1.

References

Hasan

M.K.

Ahmed

M.M.

Hashim

A.H.A.

Razzaque

Islam

and Pandey

, A novel artificial intelligence based timing synchronization scheme for smart grid applications, Wirel. Pers. Commun 114(2) (Sep. 2020), 1067–1084. doi: 10.1007/s11277-020-07408-w.

AL-Jumaili

A.H.A.

Al Mashhadany

Y.I.

Sulaiman

and Alyasseri

Z.A.A.

, A conceptual and systematics for intelligent power management system-based cloud computing: Prospects, and challenges, Appl. Sci 11(21) (Oct. 2021), 9820. doi: 10.3390/APP11219820.

Rao

S.N.V.B.

et al., Day-ahead load demand forecasting in urban community cluster microgrids using machine learning methods, Energies 15(17) (Aug. 2022), 6124. doi: 10.3390/en15176124.

AL-Jumaili

A.H.A.

Muniyandi

R.C.

Hasan

M.K.

Singh

M.J.

Paw

J.K.S.

and Amir

, Advancements in intelligent cloud computing for power optimization and battery management in hybrid renewable energy systems: A comprehensive review, Energy Reports 10 (2023), 2206–2227. doi: 10.1016/j.egyr.2023.09.029.

Guo

Zhang

and Sun

, An efficient state estimation algorithm considering zero injection constraints, IEEE Trans. Power Syst 28(3) (2013), 2651–2659. doi: 10.1109/TPWRS.2012.2232316.

Sabir

Asif Zahoor Raja

Guirao

J.L.G.

and Shoaib

, A novel design of fractional Meyer wavelet neural networks with application to the nonlinear singular fractional Lane-Emden systems, Alexandria Eng. J 60(2) (2021), 2641–2659. doi: 10.1016/j.aej.2021.01.004.

Kim

and Bang

, Introduction to Kalman Filter and Its Applications, vol. 1. IntechOpen London, UK, 2019.

Hogg

and Leschziner

M.A.

, Computation of highly swirling confined flow with a reynolds stress turbulence model, AIAA J 27(1) (1989), 57–63. doi: 10.2514/3.10094.

Amir

, Zaheeruddin and Haque

, Intelligent based hybrid renewable energy resources forecasting and real time power demand management system for resilient energy systems, Sci. Prog 105(4) (Oct. 2022), 003685042211321. doi: 10.1177/00368504221132144.

10.

Choi

Y.D.

lacovides

and Launder

B.E.

, Numerical computation of turbulent flow in a square-sectioned 180 deg bend, J. Fluids Eng. Trans. ASME 111(1) (1989), 59–68. doi: 10.1115/1.3243600.

11.

Kober

Schiffer

H.W.

Densing

and Panos

, Global energy perspectives to 2060 – WEC’s World Energy Scenarios 2019, Energy Strateg. Rev 31 (2020). doi: 10.1016/j.esr.2020.100523.

12.

Hurst

Montañez

C.A.C.

and Shone

, Time-pattern profiling from smart meter data to detect outliers in energy consumption, IoT 1(1) (2020), 92–108. doi: 10.3390/iot1010006.

13.

Zanetti

Jamhour

Pellenz

Penna

Zambenedetti

and Chueiri

, A tunable fraud detection system for advanced metering infrastructure using short-lived patterns, IEEE Trans. Smart Grid 10(1) (2019), 830–840. doi: 10.1109/TSG.2017.2753738.

14.

Yip

S.C.

Tan

C.K.

Tan

W.N.

Gan

M.T.

and Bakar

A.H.A.

, Energy theft and defective meters detection in AMI using linear regression, in: Conference Proceedings – 2017 17th IEEE International Conference on Environment and Electrical Engineering and 2017 1st IEEE Industrial and Commercial Power Systems Europe, EEEIC/I and CPS Europe 2017, 2017. p. 6. doi: 10.1109/EEEIC.2017.7977752.

15.

Singh

S.K.

Bose

and Joshi

, Entropy-based electricity theft detection in AMI network, IET Cyber-Physical Syst. Theory Appl 3(2) (2018), 99–105. doi: 10.1049/iet-cps.2017.0063.

16.

Alobaidy

H.A.H.

Mandeep

J.S.

Nordin

Abdullah

N.F.

Wei

C.G.

and Soon

M.L.S.

, Real-World Evaluation of Power Consumption and Performance of NB-IoT in Malaysia, IEEE Internet Things J 4662(c) (2021), 1–1. doi: 10.1109/jiot.2021.3131160.

17.

Saifi

I.A.

Haque

Amir

and Bharath Kurukuru

V.S.

, Intelligent Islanding Classification with MLPNN for Hybrid Distributed Energy Generations in Microgrid System, in: 2023 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE), Jan. 2023. pp. 982–987. doi: 10.1109/IITCEE57236.2023.10091089.

18.

Al-Jarrah

O.Y.

Al-Hammadi

Yoo

P.D.

and Muhaidat

, Multi-layered clustering for power consumption profiling in smart grids, IEEE Access 5 (2017), 18459–18468. doi: 10.1109/ACCESS.2017.2712258.

19.

et al., High-precision dynamic modeling of two-staged photovoltaic power station clusters, IEEE Trans. Power Syst 34(6) (2019), 4393–4407. doi: 10.1109/TPWRS.2019.2915283.

20.

Wang

Yang

Xie

Yang

and Chen

, Real-time subsynchronous control interaction monitoring using improved intrinsic time-scale decomposition, J. Mod. Power Syst. Clean Energy 11(3) (May 2023), 816–826. doi: 10.35833/MPCE.2021.000464.

21.

Min

et al., Toward interpretable anomaly detection for autonomous vehicles with denoising variational transformer, Eng. Appl. Artif. Intell., Jan. 2024, 107601. doi: 10.1016/j.engappai.2023.107601.

22.

Zhao

Fang

Min

Wang

and Teixeira

, Potential sources of sensor data anomalies for autonomous vehicles: An overview from road vehicle safety perspective, Expert Syst. Appl. 236 (Feb. 2024). doi: 10.1016/j.eswa.2023.121358.

23.

Cao

Zhang

Wang

Zhao

and Zhang

, A memetic algorithm based on two_Arch2 for multi-depot heterogeneous-vehicle capacitated arc routing problem, Swarm Evol. Comput. 63 (2021), 100864.

24.

Singh

Amir

Ahmad

and Refaat

S.S.

, Enhancement of frequency control for stand-alone multi-microgrids, IEEE Access 9 (2021), 79128–79142. doi: 10.1109/ACCESS.2021.3083960.

25.

Singh

Amir

and Arya

, Optimal dynamic frequency regulation of renewable energy based hybrid power system utilizing a novel TDF-TIDF controller, Energy Sources, Part A Recover. Util. Environ. Eff 44(4) (Dec. 2022), 10733–10754. doi: 10.1080/15567036.2022.2158251.

26.

Hasan

M.K.

et al., Dynamic load modeling for bulk load-using synchrophasors with wide area measurement system for smart grid real-time load monitoring and optimization, Sustain. Energy Technol. Assessments 57 (2023), 103190. doi: 10.1016/j.seta.2023.103190.

27.

Wang

Chen

Kang

and Xia

, Clustering of electricity consumption behavior dynamics toward big data applications, IEEE Trans. Smart Grid 7(5) (2016), 2437–2447. doi: 10.1109/TSG.2016.2548565.

28.

Aghabozorgi

Seyed Shirkhorshidi

and Ying Wah

, Time-series clustering – A decade review, Inf. Syst 53 (2015), 16–38. doi: 10.1016/j.is.2015.04.007.

29.

Hassan

M.H.

and Muniyandi

R.C.

, An improved hybrid technique for energy and delay routing in mobile ad-hoc networks, Int. J. Appl. Eng. Res 12(1) (2017), 134–139.

30.

Gong

Wang

and You

, Distributed evidential clustering toward time series with big data issue, Expert Syst. Appl 191(August 2021) (2022), 116279. doi: 10.1016/j.eswa.2021.116279.

31.

Shah

Haque

Amir

and Kumar

, Investigation of Renewable Energy Integration Challenges and Condition Monitoring Using Optimized Tree in Three Phase Grid System, in: 2023 7th International Conference on Computing Methodologies and Communication (ICCMC), Feb. 2023. pp. 1582–1588. doi: 10.1109/ICCMC56507.2023.10083636.

32.

Ayoub

Haque

Amir

and Kurukuru

V.S.B.

, Intelligent Islanding Classification with Optimal k-Nearest Neighbors Technique for Single Phase Grid Integrated PV System, in: 2022 IEEE 3rd Global Conference for Advancement in Technology (GCAT), Oct. 2022. pp. 1–6. doi: 10.1109/GCAT55367.2022.9972088.

33.

Elkawkagy

and Elbeh

, High performance hadoop distributed file system, Int. J. Networked Distrib. Comput 8(3) (2020), 119–123.

34.

AL-Jumaili

A.H.A.

Muniyandi

R.C.

Hasan

M.K.

Paw

J.K.S.

and Singh

M.J.

, Big data analytics using cloud computing based frameworks for power management systems: Status, constraints, and future recommendations, Sensors 23(6) (2023), 2952. doi: 10.3390/s23062952.

35.

Al-Sharqi

Ahmad

A.G.

and Al-Quran

, Interval-valued neutrosophic soft expert set from real space to complex space, C. Model. Eng. Sci 132(1) (2022), 267–293.

36.

Oussous

Benjelloun

F.-Z.

Ait Lahcen

and Belfkih

, Big Data technologies: A survey, J. King Saud Univ. – Comput. Inf. Sci 30(4) (2018), 431–448. doi: 10.1016/j.jksuci.2017.06.001.

37.

Anil

et al., Apache mahout: Machine learning on distributed dataflow systems, J. Mach. Learn. Res 21(1) (2020), 4999–5004.

38.

Pop

, Machine Learning and Cloud Computing: Survey of Distributed and SaaS Solutions, arXiv Prepr. arXiv1603. 08767, 2016, [Online]. Available: http://arxiv.org/abs/1603.08767.

39.

Palaniswami

Rao

A.S.

Kumar

Rathore

and Rajasegarar

, The role of visual assessment of clusters for big data analysis: From real-world internet of things, IEEE Syst. Man, Cybern. Mag 6(4) (2020), 45–53. doi: 10.1109/msmc.2019.2961160.

40.

Xia

Ning

and He

, Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform, J. Grid Comput 18(2) (2020), 263–273. doi: 10.1007/s10723-019-09504-z.

41.

Yuan

and Yang

, Research on K-Value Selection Method of K-Means Clustering Algorithm, J 2(2) (2019), 226–235. doi: 10.3390/j2020016.

42.

Tarekegn

A.N.

Michalak

and Giacobini

, Cross-validation approach to evaluate clustering algorithms: An experimental study using multi-label datasets, SN Comput. Sci 1(5) (2020), 1–9. doi: 10.1007/s42979-020-00283-z.

43.

Singh

Dahiya

Grover

Adlakha

and Amir

, An effective cascade control strategy for frequency regulation of renewable energy based hybrid power system with energy storage system, J. Energy Storage 68 (Sep. 2023), 107804. doi: 10.1016/j.est.2023.107804.

44.

Ansari

M.Y.

Ahmad

Khan

S.S.

Bhushan

and Mainuddin, Spatiotemporal clustering: A review, Artif. Intell. Rev 53(4) (2020), 2381–2423. doi: 10.1007/s10462-019-09736-1.

45.

Taamneh

Qawasmeh

and Aljammal

A.H.

, Parallel and fault-tolerant k-means clustering based on the actor model, Multiagent Grid Syst 16(4) (2020), 379–396. doi: 10.3233/MGS-200336.

46.

Capó

Pérez

and Lozano

J.A.

, An efficient K-means clustering algorithm for tall data, Data Min. Knowl. Discov 34(3) (2020), 776–811. doi: 10.1007/s10618-020-00678-9.

47.

Maroosi

Muniyandi

R.C.

Sundararajan

and Zin

A.M.

, Parallel and distributed computing models on a graphics processing unit to accelerate simulation of membrane systems, Simul. Model. Pract. Theory 47 (2014), 60–78. doi: 10.1016/j.simpat.2014.05.005.

48.

Maroosi

and Muniyandi

R.C.

, Accelerated execution of P systems with active membranes to solve the N-queens problem, Theor. Comput. Sci 551(C) (2014), 39–54. doi: 10.1016/j.tcs.2014.05.004.

49.

Feng

Huang

Tang

W.H.

and Shahidehpour

, Data mining for abnormal power consumption pattern detection based on local matrix reconstruction, Int. J. Electr. Power Energy Syst 123(March) (2020), 106315. doi: 10.1016/j.ijepes.2020.106315.

Parallel power load abnormalities detection using fast density peak clustering with a hybrid canopy-K-means algorithm

Abstract

Keywords

1. Introduction

1.1 The contributions of this paper are as follows

1.2 The paper is organized as follows

2. Materials and methods

2.1 Implementation process of K-means and canopy algorithms under the hadoop platform

Table 1 The capabilities of Apache Mahout that are lacking in Hadoop

Dataset

2.3 K-means algorithm

3. Proposed Hybrid (CKMA) algorithm methods

4. Experimental set-up and results

4.1 Experiments setup

4.2 Experimental and discussion

Table 3 Hybrid approach, K-means, and Canopy algorithms evaluation metrics for each K range from 30 to 120

Table 4 Large dataset, the AUC of Canopy, Hybrid and K-means with different k values tested

Table 5 Small dataset, the AUC of Canopy, Hybrid and K-means with different k values tested

1. Small data performance

2. Large data performance

5. Conclusion

Footnotes

Acknowledgments

Conflict of interest

Funding

References

Table 1
The capabilities of Apache Mahout that are lacking in Hadoop

Table 3
Hybrid approach, K-means, and Canopy algorithms evaluation metrics for each K range from 30 to 120

Table 4
Large dataset, the AUC of Canopy, Hybrid and K-means with different $k$ values tested

Table 5
Small dataset, the AUC of Canopy, Hybrid and K-means with different $k$ values tested