Effective management of class imbalance problem in climate data analysis using a hybrid of deep learning and data level sampling

Abstract

Climate change and its consequences for human life have emerged as the world’s most pressing challenge. Due to the complexity, veracity, and velocity of climate data, a traditional, simple, and single machine learning model will not be sufficient to perform effective and timely analysis. The climate data can be effectively analyzed, and climate models can be developed with the proposed hybrid model. The deep learning AutoEncoder (AE) is used for feature extraction, removal of redundant and noisy data. The Synthetic Minority class Oversampling (SMOTE) technique to generate samples in minority class to mitigate the imbalance in the sample distribution. Extreme Learning Machine (ELM) is used for further feature classification. The proposed method exploits big data strategies and the results interpretation process to extract accurate insight from climate data. ELM handles the class imbalance problem to improve the performance of the Early Warning System (EWS) model and fine-tune it. The hybrid method drastically reduces the computation cost and improves the accuracy to 93%, 86%, 95%, and 98% of four different datasets against other machine learning models. The experimental results of the AE_SMOTE_ELM model, compared with other state-of-the-art deep learning methods, shows accuracy and an efficiency of 90.4% and 91.76%, respectively, for two climate datasets.

Keywords

AutoEncoder SMOTE ELM climate data class imbalance

1 Introduction

Climate change is a critical environmental event that significantly impacts countries’ social and economic standing. Climate data is gathered from various sources, including weather stations, weather balloons, radars, ships and buoys, and satellites. Every country has tens of thousands of such data sources that are made available frequently. The rate of change in magnitude and frequency of climate change associated with severe weather events is extreme.

The magnitude of severe weather events and the frequency of climate change are rising. The climatic data generation and processing happen at an increasingly higher pace, making the traditional system’s response ineffective.

In this scenario, automated and faster update in the machines’ decision making is necessary to override the public and private interventions. Deep learning has emerged as a potent tool for data analysis in recent years. The proposed work aims to demonstrate how deep machine learning can be used in climate change prediction to reduce the impact on society through effective engineering. A thorough analysis is carried out by utilizing existing data to construct an extreme event forecasting model.

Many methodologies have been proposed in the past for an accurate classifier. The enhancement of the generalization capacity of the integrated deep learning strategies is adopted to manage the deluge of climate data.

In climate data analysis, machine learning and deep learning concepts such as AE, Principal Component Analysis, Extreme Learning Machine (ELM), and Support Vector machine are used to understand climate data better. The prediction accuracy is improved by adopting the methodologies of deep learning and machine learning. The hybrid model is compared with various other models, including state-of-the-art deep learning models. AE deep learning is a non-linear technique in an unsupervised machine learning algorithm that aids in dimensionality reduction. The AE consists of two components viz., an Encoder that learns the data representation, important features of the data, and a Decoder that reconstructs the data based on the idea of how it is structured. The performance of the final classification shows improvement in finding out the anomalies due to climate change. The improved performance helps to build an efficient Early Warning System (EWS).

Proceeding from the background study of the AE, SMOTE, and ELM, the contributions of the paper are as follows:

Preprocess of data for noise removal, feature selection, and redundancy removal by AE and evaluation by reconstruction error

The SMOTE method generates samples in minority class to mitigate the imbalance in the sample distribution, to overcome overfitting

ELM –Binary classification of oversampled data and to overcome class imbalance problem

Comparison with other hybrid methods SMOTE_ELM, Weighted Extreme Machine Learning (W_ELM), Fuzzy Support Vector Machines for Class Imbalance Learning (FSVM_CIL), and Principal Component Analysis optimized Generative Adversarial Networks Extreme Learning Machine (PCA_GAN_ELM)

Performance of the evaluation metric shows improvement in accuracy and efficiency

Comparison with the state-of-art methods in Deep Learning to affirm the efficiency of the proposed method

In this paper, an integrated model of ELM and AE is proposed, which combines the best attributes of an AE and rapid training of ELM. AE provides high-level features from input data after removing the redundant and irrelevant features from the input dataset. The SMOTE method generates samples in minority class to mitigate the imbalance in the sample distribution, to control overfitting. The abstracted features from the balanced dataset are then fed to ELM for the classification of data. The accuracy of prediction and sensitivity is improved, and uncertainty is reduced in the outcome of the classifier. The performance is evaluated by the following metrics viz., Root Mean Square Error (RMSE), Sensitivity, F1 measure, Precision, recall, and Receiver Operating Characteristics (ROC). This integrated model is compared with ELM, Online Sequential Extreme Learning Machine (OSELM), and Weighted Extreme Learning Machine (WELM) to improve the model design.

The following parts comprise the paper:

Problem description and background study detailed in sections Introduction, Related works, and Proposed work

The hybrid model of AE, SMOTE, and ELM are discussed in section Methods

The datasets used, evaluation metrics and analysis of outcome are presented in the Performance Analysis section

Comparison with other machine learning and deep learning models are presented in Comparison with deep learning methods and conclusion sections

The objective of the proposed method is to investigate the performance of climate data forecast with data preprocess methods. AE is used to preprocess the time series data, remove noise, identify missing data, and feature selection. Preprocessing is done on a vast dataset to reduce the dataset size without losing information with improved accuracy in forecasting.

2 Related work

This section reviews the work carried out by the researchers in identifying class imbalance problems using machine learning techniques and deep learning methods.

In climate change study, machine learning techniques make a computerized, intelligent prediction in the weather forecast. Since the distribution of imbalance classes also occurs in real-world climate datasets, this causes the loss of valuable information from an abnormal class about weather prediction. The outcome of abnormal class in climate data analysis shows drawbacks in the forecast. The training set has more dimensional features resulting in lower classification accuracy of abnormal and minority classes with incorrect prediction. Therefore, for an imbalanced climate dataset, an appropriate classification model-building is an emergency intervention. Many studies have suggested methods to manage the classification of class-imbalanced data.

Learning from the imbalance that exists in the data leads to the problem of uneven data distribution. Because of the presence of an imbalance, the importance of the minority class is recognized, which is beneficial in preventing issues of handling rare cases in monitoring climate systems [22]. The classical machine learning algorithms do not show stable classification accuracy for imbalanced noisy datasets [31]. The deep learning AE, in conjunction with other machine learning algorithms, is used to reduce dimensions, provide the unavoidable result of an event flow prediction, and improve the retrieval of high-quality data in the temporarily under-sampled data domain [2].

To avoid system collapse or other abrupt shifts, early warning signals (EWS) indicating tipping points are critical. Sumit, T.M. et al suggest that DL algorithms can not only improve the sensitivity and specificity of EWS for authority transitions but also can be applied to a wide range of systems. By combining dynamical insights with deep learning models, the EWS of tipping points with improved sensitivity, specificity were obtained from the existing systems. The prediction of the type of tipping point offers particular qualitative information about the new state that exists beyond the tipping point [5].

In India, extreme weather poses a variety of threats to its economy and society, and future climate change is predicted to exacerbate these threats. The data outcomes from the physical and socioeconomic activities are utilized to spatially-disaggregated estimates on temperature sensitivities of the Indian Climate Early Warning System design. Climate impacts are not only future events but also current requirements with direct policy implications, necessitating the creation of an Indian Climate Early Warning System [6].

Statistical representation and numerical modeling of weather forecasting with the recent advancement in deep neural networks and other machine learning technologies show the improvement in computing and prediction efficiency. The research segregates the time scale of the weather report as short, medium, and long-range work. The gap exists in the availability of high-quality data on the weather with timely feedback. This could not be handled by conventional machine learning techniques. The introduction of deep learning models is expected to show improved prediction. There is evidence of the adaption of machine learning techniques by the weather and climate community in parameter estimation and extraction of dynamic modes [7].

According to IPCC’s AR6 report, the results from the latest climate model show a larger climate sensitivity parameter, which is the increase in temperature due to the doubling of atmospheric CO₂. Comparison of the latest models with the previously existing approach exhibits the probability of a global tipping point. The above gap can be fulfilled by introducing newer deep learning technologies, climate prediction models that can be built with richer indigenous data, feedback in the Earth system. The efficiency of models should be such as to capture earlier, sudden climate changes and hot, greenhouse climate states must be improved to reach the prediction confidence ability [18]

According to Climate Risk Management (CRM), the highlighted factors are Accountability and Efficiency [27]. It emphasizes the location of CRM in ministry, emergency management departments with planning and fiscal responsibilities to provide political authority and policy reformation. If CRM is implemented at the local bodies, in partnership with households, communities, and other Non-Governmental Organizations (NGOs), benefits such as sustainability and cost-effectiveness can be attained, and efficiency can be improved.

The predictive analysis of the datasets with many imbalanced distributions of the features is associated with real-world applications such as fraud detection, misdiagnosis of diseases, and misprediction in the forecast of climate. The model proposed by author Branco et al. concentrated on prediction models based on classification tasks and regression tasks [4].

Malvoni, M. et al. proposed the wavelet decomposition and principal component analysis to preprocess the meteorological data. Further, they applied the Group Least Square Support Vector Machine (GLSSVM) method, Least Square Support Vector Machine (LS_SVM) method, and Group Method of Data Handling (GMDH) on weather data to implement Photovoltaic Daylight forecast system [25].

Sivanesan et al. adopted Artificial Neural Network (ANN) and Fuzzy system on the solar forecast to improve the solar power generation by Clear sky radiation model. A numerical weather prediction model is built with fuzzy logic preprocessing, which aids in forecasting accuracy [30].

Class imbalance learning on the sparse data structure is handled using a binary classification setting. The Gmean output can be regenerated to the convex loss function by the convex relaxation technique. An L1 regularized proximal learning framework and a stochastic proximal gradient descent algorithm are used [26].

The authors suggested the excess gap between the conventional and current technologies in handling the imbalance in the dataset. The survey discussed intelligence sampling and hybrid sampling in the learning module in retaining difficult-to-learn samples and eliminating easy-to-learn samples [33]. The class imbalance challenges are addressed by deep learning methods instead of the traditional machine learning methods. The adaptability of deep learning improves data sampling and the incorporation of cost-sensitive learning to produce better results. The authors have discussed data-driven, algorithm-driven, and data-and-algorithm-driven methods in deep learning on class imbalance data [21].

Hassani et al. emphasized performing big data analytics in climate data by observing the climate change data, and achieving better prediction. The researchers have opened up an extensive review on climate change to get insight into climate knowledge. In smart farming, energy conservation, weather forecasting, and addressing natural disaster management, the implementation of research work has helped [12].

Weather forecasting is essential in air traffic services, farming, flood, energy, and environmental sustainability. Saba and Rehman, and Al Ghamdi developed a hybrid model comprising Multilayer Perceptron and Radial Basis Function neural networks to enhance the accuracy of the climate prediction model [29]. The weather forecasting model experimented with ANN combined with machine learning to predict the maximum temperature and to optimize the performance of the model on micro and macro environmental factors [1, 3]. Wonji Lee et al. referred to weighted Support Vector Machine (SVM) to classify imbalanced data. This study attempted to address overlaps, disjunctions, and data shift along with the class imbalance problem. The paper also explains adopting various deep learning methods over machine learning techniques and providing insight into the class imbalance problem [23].

3 Proposed Work

In the existing hybrid methods, the pipelining of AE, SMOTE, and ELM is proposed to improve the efficiency in handling the class imbalance of climate data. The framework of the proposed work is shown in the flow diagram in Fig. 1. The imbalanced data from various climate weather data sources are collected and pre-processed with the help of AE. The quantification of generated samples is evaluated with RMSE values. The generated samples of climate data are evaluated by the performance metrics viz., accuracy, Gmean, F1 score after binary classification, and a comparison with other hybrid methods to prove superior performance AE_SMOTE_ELM method by overcoming the imbalanced nature of the dataset.

Fig. 1

AE_SMOTE_ELM Model Framework.

The AE_SMOTE_ELM classification model is proposed to address the limited and imbalanced nature of climate data. The initial part of the model, AE, extracts features and reduces dimensions. It is followed by SMOTE that generates real minority samples to balance the distribution of class. The numerical relationship between the imbalance ratio and the hyper parameters in the hidden nodes of the ELM is defined based on the number of features and samples. Figure 1 shows the whole work process model in sequence.

4 Methods

4.1 AutoEncoder (AE)

AE is a neural network that includes the function that maps the input feature x with the output feature $\hat{x}$ . The encoder activation function f(x) is usually the Rectified Linear Unit (ReLU) function. It is a non-linear activation function that is used in deep neural networks. The function represented with x is the input value in the expression, $f (x) = max (0, x) = {\begin{matrix} 0, if x < 0 \\ x, if x ⩾ 0 \end{matrix}$ (1)

The purpose of ReLU is to activate the neurons by passing appropriate input values and increasing acceleration. $\frac{d}{dx} ReLU (x) = 1 \forall x > 0$ (2)

Ibrahim Gad et al. suggested the Stochastic Gradient Descent (SGD) optimizer to predict the missing data imputation deep learning model. The SGD optimizer is evaluated with root mean square error(RMSE), mean absolute error and mean square error(MSE) on weather data [10].

The robust optimizer of Adadelta performs the compilation of AE. It evaluates the loss of MSE between the encoder and the decoder with the SGD method to speed up the learning process.

The data subject to the dimensionality reduction is the data in tabular form of columns. The normalization of different columns has been performed by, $x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}$ (3)

The activation function at the output layer is the sigmoid function given by, $y = \frac{1}{1 + e^{- x}}$ (4)

It is also known as the class imbalance problem in machine learning, where the total number of one class data is far less than the other class of data. The class imbalance problem is explained in [19]. Because of the unbounded size and imbalanced nature of data, classification of data becomes difficult. The anomaly is visible from the result of AE’s reconstruction error rate. If there is an anomaly in the dataset, the value of the reconstruction error rate will be at an elevated level. The mean square distance error is input to find out the reconstruction error.

The loss function (L) between the input and output is given by, $L = \frac{1}{n} \sum_{i = 0}^{n - 1} {({\hat{x}}_{1} - x_{i})}^{2}$ (5)

The model can be trained with the climate datasets, applying few feature engineering techniques on the feature set, and dimensionality can be reduced.

Algorithm 1 explains the steps involved in the AE method to preprocess the imbalanced dataset.

Algorithm 1: Feature Selection by Filtered_AutoEncoder

Input:

X - input feature, λ - regularization weight, N - number of hidden neurons, ɛ - the threshold to exit the iteration

Output:

AE_Model

Method:

Initialize Loss(L) to zero and Model to AE_Model

1. for i = 1 to N

2. Compute activation function a_f

3. Add Layer of n_i neurons

4. Compute L_i

5. end for

6. for i = 1 to N

7. L_i = Model.Train(L_i,SGD)

8. if (Li –L_i - 1< ɛ)

9. break;

10. end if

11. end for

12. return Model

The algorithm says that X - input parameter, λ - regularization weight, N - number of hidden neurons, a_f - activation function, ɛ - the threshold to check the iteration in which the loss converges, For i^th layer, n_i is the number of neurons, where i takes value from 1 to absolute of N. The loss function is computed initially. The neural network model is built with N and a_f, the training of the model is performed by the stochastic gradient descent method and the error is calculated in each iteration. The error is compared with the loss from the previous iteration. The iteration continues till the difference reaches the value lesser than the threshold ɛ. The SGD optimizer in AE is input with a learning rate of.01, and the momentum of value is varied from 0 to 1.0 in steps of 0.2. This input will accelerate the gradient descent in the appropriate direction. The AE produces the best accuracy of 68.14% when it reaches 100 and momentum at 0.2. The combination of a 20% dropout rate and a weight constraint of 4 limits the overfitting effort and improves the model’s ability to generalize. The cross-fold validation of AE is set up to 5 to achieve stability in the results.

4.2 SMOTE

SMOTE is the oversampling of minority classes. The method is developed and tested with various settings, viz., different feature sets, different percentages, and the different number of nearest neighbors. The advantage of the k-nn setting is to avoid overfitting problems. Algorithm 2 explains the sample generation of minority class by random generation, and the samples are confirmed by the k nearest neighbor approach.

Algorithm 2: SMOTE (AE_Model, N, k)

Input:

AE_Model output - number of samples of minority class (T), N in % of SMOTE quantity, k - number of nearest neighbor, N_Attr – number of features, Minority array –Minority samples before randomization

Output:

N% of T – generated samples of minority class. Minority synthetic array – Minority synthetic samples generated

Method:

Initialize index, n_index

1. if N < 100%

2. Randomize T minority class samples

3. T ← N % of T

4. N ← 100

5. end if

6 SMOTE (N) = integral multiples of 100

7. for i = 1 to T do

8. Compute k nearest neighbors

9. Store indices in n_index[]

10. Aggregate(N, i, n_index)

11. end for

12. return Minority_Synthetic array

13. function Aggregate(N,i,n_index)

14. while True(N)

15. krand = randomize(1:k)

16. for attr = 1 to N_Attr do

17. diff = Minority[n_index[krand]][attr]

- Minority[i][attr]

18. gap = randomize(0 : 1)

19. Minority_Synthetic[index][attr] =

Minority[i][attr] –diff * gap

20. end for

21. index + = 1

22. decrement N by 1

23. end while

24. Return

25. end-function

4.3 ELM - Extreme Learning Machine

The classification model’s structure is determined based on the number of hidden nodes and hidden layers. If the size of the ELM is too small, it will result in inefficient feature learning and underfitting problems. As the network size grows, so does the time complexity, resulting in poor model performance. The change in imbalance ratio after SMOTE and the introduction of new samples will decide the structure of ELM.

In the dataset D with N samples of the training set, are given as a set of {(x_i,y_i)} for i = 1 to N, where x_i denotes the feature vector of i^th sample and y_i is the class name of the i^th sample. In this dataset, the number of minority class samples is given as N⁺ and the number of majority class samples as N^-. The total training samples is N = N⁺ + N^-.

The N samples of the imbalanced set is raised to The initial imbalanced ratio of the training set is IR, as $\frac{N^{+}}{N^{-}}$ , SMOTE the imbalanced ratio changes to , after new samples are added.

The change of imbalance ratio ΔIR is given as a difference between IR and . The mathematical model of ELM is expressed as $H β = T$ (6) where H = [h(x₁), h(x₂), . H(x_N)]^T is the output matrix training instances of the hidden layer, β is the output layer’s weight matrix and T is the target matrix, [t₁,t₂, t_N].

The internal ELM is explained by the functional Equation (7), $β = H_{Q}^{T} {(\frac{I}{C} + H_{Q} H_{Q}^{T})}^{- 1} T_{Q}$ (7) where H_Q is the output matrix of the Q^th hidden layer, $H_{Q}^{T}$ is the transpose of the output matrix, with the random weight between input nodes and hidden nodes and random threshold of the hidden neurons, I is the unit matrix, C is the regularization coefficient. After applying the activation function g(.) is the sigmoid function $g (u) = \frac{1}{(1 + e^{- au})}$ , final output matrix β is obtained.

The hidden node number is obtained by P, $P = [(1 - Δ IR) X \frac{N}{M} + Δ IRX \frac{N^{'}}{M^{'}}]$ (8) where M is the count of the original feature set, is the count of reduced feature set after noise removal from preprocessing. The hidden layer count is computed using, $Q = Δ {IRXM}^{'}$ (9)

Algorithm 3: ELM

Input:

SMOTE sample

Output:

Target matrix T, output layer’s weight matrix β, output matrix of the hidden layer -H

Method:

1. for i = 1 to Q do

2. compute β

3. output matrix H_i = g(H_i - 1.β)

4. calculate Target Matrix T_i = H_i β

5. end for

5 Performance Analysis

To evaluate the performance of the proposed AE_SMOTE_ELM method, it is compared with normal SMOTE_ELM [11], W_ELM [38], FSVM_CIL [28] and PCA_GAN_ELM [36] methods. The experimental setup is executed on the Intel(R) Core (M) i5 CPU, 3.20 GHz with installed RAM of 4.00GB, 64 bit operating system, Python 3.0 version.

5.1 Datasets

The performance of the proposed method on four real-world datasets of climate obtained from the University of California, Irvine (UCI) repository and Kaggle repository is studied. The datasets are of various types, including temperature, wind speed, and rain. The normalization of data is done before analysis. The experiment is performed to show the existence of binary classification on the imbalanced dataset. The datasets consist of several dimensions; instances are split into number of positive/majority data and negative/minority data and imbalance ratio (IR).

The datasets are sourced from the Kaggle repository on the World Bank of Climate Change (WBCC), Local Climatological Data (LCD) from National Climatic Data Center (NCDC), NOAA government, Chennai Airport (CA) dataset, and Ozone Level Detection (OLD) data of UCI repository. The summary of datasets is shown in Table 1.

Table 1
Climate Datasets from four different data sources

Datasets Features Instances Majority Minority IR Data Source

World Bank Climate Change [16] 28 13512 11238 2274 0.202 Kaggle

Local Climatology Data [17] 86 773 553 220 0.39 NCDC

Chennai Weather [14] 19 1500 963 537 0.557 Airport

Ozone level Detection [9] 73 2536 2464 72 0.029 UCI

Datasets	Features	Instances	Majority	Minority	IR	Data Source
World Bank Climate Change [16]	28	13512	11238	2274	0.202	Kaggle
Local Climatology Data [17]	86	773	553	220	0.39	NCDC
Chennai Weather [14]	19	1500	963	537	0.557	Airport
Ozone level Detection [9]	73	2536	2464	72	0.029	UCI

5.2 Evaluation metrics

5.2.1 True Positive Rate(TPR)

TPR measure is referred to as Sensitivity, correctly classified number of positives as True Positives. The recall is the same as TPR. $TPR = \frac{TP}{TP + FN}$ (10)

5.2.2 True Negative Rate (TNR)

TNR measure is referred to as Specificity, the number of negatives correctly recognized as negatives. $TNR = \frac{TN}{TN + FP}$ (11)

5.2.3 Precision

Precision represents the proportion of true positives correctly classified to the total number of positive samples. $Precision = \frac{TP}{TP + FP}$ (12)

5.2.4 Gmean

The class imbalance is measured using Gmean. The True Positive Rate (TPR) and the True Negative Rate (TNR) are input in computing Gmean. $Gmean = \sqrt{(TPR * TNR)}$ (13)

Gmean is a statistical measure of the accuracy of classes in classification. Gmean is one of the effective evaluation criteria that is independent of data distribution.

5.2.5 F1 score

F1 score is the metric to evaluate the performance of the integrated method. It is the harmonic mean of Precision and recall and does not include true negatives into consideration and is given by, $F 1 Score = \frac{2 TP}{2 TP + FN + FP}$ (14)

The Intersection Over Union (IoU) is given by, $IoU = \frac{TP}{(TP + FP + FN)}$ (15)

5.2.6 Accuracy

The accuracy of the classifier is computed using the expression, $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ (16)

The accuracy value alone does not decide the effectiveness of the classification method.

For climate data analysis, a high accuracy value provides a complete prediction of normal weather conditions. This analysis should not miss out on the impulse variation of weather conditions that leads to significant loss to lives, livestock, and the environment. The evaluation metrics of the proposed model improve the confidence level and aid in the development of the EWS.

5.3 Analysis of AE_SMOTE

The AE undergoes the training phase, validation phase, and testing phase, where the data split into 60%, 20%, and 20% encoding of the training sample gets compressed into a latent representation at the bottleneck. The difference between the two losses gives the reconstruction error. It should be smaller than the threshold value. If this is not satisfied, there exists an outlier which leads to a class imbalance problem. The result in Table 2 clearly explains the necessity of data preprocessing by AE.

Table 2
Training and Testing Accuracy of ELM with AutoEncoder as the preprocessor

No. of. hidden neurons Training Accuracy (%) Testing Accuracy (%)

20 88.33 84.81

40 91.2 89.14

60 92.33 90.12

80 93.4 91.36

100 94.67 93.09

No. of. hidden neurons	Training Accuracy (%)	Testing Accuracy (%)
20	88.33	84.81
40	91.2	89.14
60	92.33	90.12
80	93.4	91.36
100	94.67	93.09

The values of the accuracy of training and testing data have the effect of the presence of impure data, which affects the performance of the ELM classifier. The number of hidden neurons has an impact on accuracy in ELM.

From the given imbalanced dataset, significant attributes are selected and extracted by AE deep learning method. This step ensures that the unwanted noises are removed, and the dimension of the original dataset is reduced to contain significant features. SMOTE oversamples the dataset to compensate for the imbalance caused by minority samples. Based on the k nearest neighbor algorithm in SMOTE, more samples are generated near the k number of classes. The samples are retained in the class based on RMSE. The table of values clearly shows that the accuracy of ELM increases as the number of hidden neurons increases. For a value of 60 hidden neurons, the training accuracy is increased to 92.33% and testing accuracy to 90.12%. The dataset utilized in the classification is the Local Climatological Dataset.

Table of Confusion matrix represents 0 for majority class samples, and 1 for minority class samples of different climate-related datasets is given in Table 3. The numerals represent correctly classified and misclassified samples on normal and extreme weather conditions. Consequently, True Positive Rate (TPR) and False Negative Rate (FNR) are calculated. According to the confusion matrix, the rate of misclassified minority sample is less than the rate of misclassified majority sample for the Ozone Level Detection Dataset. The proposed hybrid model of AE_SMOTE_ELM shows the improvement in minority sample classification rate and results in the betterment of the imbalanced data classification of climate data. The minimal error samples are retained with minority classes to improve the imbalanced nature of the dataset.

Table 3

Confusion Matrix of 4 climate datasets

Confusion Matrix for World Bank Climate Change Data						Confusion Matrix for Local Climatological Data
Actual Class	Predicted Class			TPR	FNR	Actual Class	Predicted Class			TPR	FNR
		0	1					0	1
	0	8210	529	93.9%	6.1%		0	510	70	87.9%	12.06%
	1	376	3070	89.1%	10.9%		1	30	90	75%	25%
Confusion Matrix for Chennai Airport Data						Confusion Matrix for Ozone Level Detection Data
Actual Class	Predicted Class			TPR	FNR	Actual Class	Predicted Class			TPR	FNR
		0	1					0	1
	0	780	40	95.1%	4.87%		0	1865	30	98.41%	1.58%
	1	20	360	94.73%	5.26%		1	20	595	96.74%	3.25%

Figure 2 depicts evaluation metrics of classification plotted for various algorithms. The metrics are Gmean, Sensitivity, Specificity, IoU, F1 Score, and Accuracy calculated for four climate data sets taken from multiple sources. The proposed AE_SMOTE_ELM method seems to be higher than almost all other hybrid algorithms, SMOTE_ELM, PCA_GAN_ELM, FSVM_CIL, and W_ELM. The metrics value varies for each algorithm, based on the data distribution of the imbalanced dataset. The classification part of the hybrid method achieves better performance by exploiting the most suitable features from the balanced dataset from SMOTE method. The chart depicts a quantitative comparison of various methods. The performance of the classifier is improved by changing the number of hidden nodes.

Fig. 2

Comparison of evaluation metrics of AE_SMOTE_ELM hybrid model with other hybrid models for four climate datasets.

The corresponding accuracy on training and testing samples exhibits a more significant impact by increasing hidden neurons. Peak performance is achieved at the optimum number of nodes in ELM, and the number of nodes is estimated to be in the range of 20 to 30.

5.4 Receiver Operating Characteristics by AUC

Figure 3 depicts the classification performance of algorithms. It illustrates that the ROC of the AE_SMOTE_ELM method outperforms the other models. The proposed model is tested on four different datasets with class imbalance problem. The Area Under ROC Curve (AUC) shows the maximum value of 1 for the WBCC dataset. The LCD has the lowest of 0.82 AUC for the proposed hybrid model.

Fig. 3

ROC curves of AE_SMOTE_ELM model with other models on four climate datasets.

The AUC of the LCD set is smaller than others, and the ROC shows the effect of higher dimensions. Figure 3 shows the ROC curves plotted for different algorithms and datasets.

5.5 Efficiency

The proposed algorithm is compared to competing classifiers SMOTE ELM, FSVM CIL, and PCA GAN ELM methods using computation time as an evaluation metric, including the training time of the classification algorithms on different climate datasets. The proposed hybrid classifier model is the fastest based on its computational cost and shows the lowest RMSE error. The proposed model is preprocessed with the AE, which helps to remove noise and redundant features, allowing better classification.

The other classifiers produce different errors on the same experimental setup and datasets. Though the performance is good, some drawbacks, such as sample generation requiring extra computational cost, result in inefficient sample generation after the SMOTE process. The Computational cost graph depicted in Fig. 4 shows that the AE_SMOTE_ELM algorithm is less on datasets of any size. The percentage RMSE graph in Fig. 5 shows that the proposed algorithm exhibits the lowest error on all datasets.

Fig. 4

The computational cost of AE_SMOTE_ELM model comparison with other hybrid models.

Fig. 5

RMSE in % for the various hybrid model on four climate datasets.

On observing the accuracy value from Fig. 2, the proposed classifier has the highest accuracy, and W_ELM has the lowest accuracy of 60% to 70% for all the datasets. The RMSE shows that the AE_SMOTE_ELM has the lowest error of 31% resulting in a better classification algorithm comparatively with SMOTE_ELM, W_ELM. The generated data samples show lesser deviation from the original data by the observed error value.

After oversampling the imbalanced dataset, the classification shows improvement in F1 measure, Gmean, and IoU values. The values from Fig. 2 exhibit the stability of the proposed algorithm.

6 Comparison with deep learning methods

The proposed hybrid method is compared with the deep learning state-of-art methods such as Dual Discriminator Generative Adversarial Nets (D2GAN) [35], Restricted Boltzmann machine (RBM) [32], Minority Oversampling Generative Adversarial Network (MOGAN) [34], Residual Joint Adaptation Adversarial Network (RJAAN) [20], and CBN Variational Autoencoder (CBN-VAE) [24].

The Australian Weather Rain dataset [15] is used for the state-of-art method experiment. The dimension of the dataset is distributed across 23 features and 145460 instances. The binary classification using rainy or clear day, referred to as RainToday, is considered as imbalanced nature of the dataset. The imbalanced nature of the dataset for RainToday is 142199:3261, a ratio of 44:1.

The CO₂ emission dataset [13] is also tested with the proposed method and compared with the state-of-the-art deep learning methods. The CO₂ emission dataset has 55 dimensions and 20931 instances.

The obtained classification accuracy of AE_SMOTE_ELM for the data gathered in the study is 90.34%, and for D2GAN, MOGAN, RJAAN, RBM, and CBN_VAE are 83.1%, 79.56%, 78.05%, 73.7%, and 80.4%, respectively. The high accuracy of the AE_SMOTE_ELM method helps to predict the rainfall in Australia. The classification accuracy of AE_SMOTE_ELM for the data gathered in the CO₂ emission dataset is 91.603%, and for D2GAN, MOGAN, RJAAN, RBM, and CBN_VAE are 85.32%, 78.91%, 77.26%, 75.83%, and 82.11%, respectively. The proposed method outperforms the other methods like D2GAN, MOGAN, RJAAN, RBM, and CBN_VAE with its classification results on Sensitivity, Specificity, Precision, Recognition Rate (RR), and Misclassification Rate (MR) in Table 4. The high accuracy of the AE_SMOTE_ELM method helps to identify the cause of CO₂ emission.

Table 4
Climate Datasets from four different data sources

Dataset Method Sensitivity % Specificity% Precision% RR MR

AustralianRainDataset [15] AE_SMOTE_ELM 0.79 0.53 92.23 90.29 9.71

D2GAN 0.68 0.43 83.65 81.25 18.75

MOGAN 0.61 0.41 72.43 71.22 28.78

RJAAN 0.65 0.47 78.11 76.98 23.02

RBM 0.58 0.44 71.08 70.23 29.77

CBN_VAE 0.62 0.43 75.34 73.06 26.94

CO₂ Emission Dataset [16] AE_SMOTE_ELM 0.922 0.871 90.45 95.73 4.27

D2GAN 0.871 0.796 84.7 84.61 15.39

MOGAN 0.836 0.727 77.64 78.45 21.55

RJAAN 0.838 0.819 82.23 81.25 18.75

RBM 0.809 0.822 72.69 75.66 24.34

CBN_VAE 0.81 0.803 77.03 75.85 24.15

Dataset	Method	Sensitivity %	Specificity%	Precision%	RR	MR
AustralianRainDataset [15]	AE_SMOTE_ELM	0.79	0.53	92.23	90.29	9.71
	D2GAN	0.68	0.43	83.65	81.25	18.75
	MOGAN	0.61	0.41	72.43	71.22	28.78
	RJAAN	0.65	0.47	78.11	76.98	23.02
	RBM	0.58	0.44	71.08	70.23	29.77
	CBN_VAE	0.62	0.43	75.34	73.06	26.94
CO₂ Emission Dataset [16]	AE_SMOTE_ELM	0.922	0.871	90.45	95.73	4.27
	D2GAN	0.871	0.796	84.7	84.61	15.39
	MOGAN	0.836	0.727	77.64	78.45	21.55
	RJAAN	0.838	0.819	82.23	81.25	18.75
	RBM	0.809	0.822	72.69	75.66	24.34
	CBN_VAE	0.81	0.803	77.03	75.85	24.15

The statistical significance of the proposed method is assessed by a statistical paired t-test [8, 37]. The stats.ttest_ind() method from the scipy library of Python was used to perform statistical t-test. The t-test was carried out by setting the significant level of 0.05. The paired t-test of AE SMOTE ELM with D2GAN, MOGAN, RJAAN, and RBM learning algorithms reveals a significant change in error rates.

When compared to state-of-the-art methods, the proposed approach achieves a ρ value of less than 0.05 on CO₂ emission dataset. Thus the improvement in classification accuracy of the proposed method is evaluated by a statistical t-test. The graphical illustration of the results of statistical assessment for various settings of ρ value is in Fig. 6.

Fig. 6

Results of statistical assessment against state-of-the-art deep learning model.

7 Conclusion and future work

A hybrid AE_SMOTE_ELM model is designed for imbalanced climate data classification. The traditional combination of AutoEncoder, SMOTE and ELM give the mathematical expressions for the hidden layers and nodes based on the imbalance ratio and sample distribution. The hybrid model demonstrates its effectiveness on four different imbalanced climate datasets. The performance is compared with SMOTE_ELM, FSVM_CIL, W_ELM, and PCA_GAN_ELM machine learning methodologies against vital statistical parameters. The AE_SMOTE_ELM shows the highest accuracy of 92% and above for datasets of different sizes. The hybrid model outperforms state-of-the-art deep learning methods D2GAN, MOGAN, RJAAN, RBM, and CBN_VAE and shows the robustness to various imbalance ratios. This performance improvement helps detect anomalies in the climate data and builds the Early Warning System. The impact of climate change analysis by the proposed hybrid algorithm provides a link between climate change and the development of a region. Data analysis on the climate change dataset by deep learning models improves socio-economic welfare, and imbalance handling helps build improved forecasting of the Early Warning System.

The hybrid method’s major flaw is the higher level of imbalance in the climate dataset in practice, showing that using deep learning to solve class imbalance is still relatively new and understudied.

In the future, the AE_SMOTE_ELM model can be enhanced by introducing a dimensionality reduction technique in the initial stage of preprocessing. The concept drift problem can be handled by the same model by cascading the ADWIN technique after SMOTE. The limitation of using basic methods of deep learning on class imbalance data will be more complex.

This finding backs up call for more research that compares several deep learning approaches across a range of class imbalance levels and problem difficulties to aid in the data analysis of climate change.

The proposed model will be expanded in the future to cover multi-class imbalanced data classification and the AE_SMOTE_ELM’s scalability in huge imbalanced data analysis.

References

Abhishek

, Singh

M.P.

, Ghosh

and Anand

, Weather forecasting model using artificial neural network, Procedia Technology 4 (2012), 311–318.

Agostini

, Exploration and prediction of fluid dynamical systems using auto-encoder technology, Physics of Fluids 32(6) (2020), 067103.

Baboo

S.S.

and Shereef

I.K.

, An efficient weather forecasting system using artificial neural network, International Journal of Environmental Science and Development 1(4) (2010), 321.

Branco

, Torgo

and Ribeiro

R.P.

, A survey of pre-dictive modelling under imbalanced distributions. CoRR. arXiv:1505.01658 (2015)

Bury

T.M.

, Sujith

R.I.

, Pavithran

, Scheffer

, Lenton

T.M.

, Anand

and Bauch

C.T.

, Deep learning for early warning signals of tipping points, Proceedings of the National Academy of Sciences of the United States of America 118(39) (2021), e2106140118. https://doi.org/10.1073/pnas.2106140118

Carleton

, Greenstone

, Hsiang

, Hultgren

, Jina

, Kopp

and Rode

, (2017). Indian climate early warning system Reference number: E-89342-INC-1.

Chantry

, Christensen

, Dueben

and Palmer

, Opportunities and challenges for machine learning in weather and climate modelling: hard, medium and soft AI, Philosophical Transactions of the Royal Society A 379(2194) (2021), 20200083.

Chen

, Brissette

F.P.

, Poulin

and Leconte

, Overall uncertainty study of the hydrological impacts of climate change for a Canadian watershed, Water Resources Research 47(12), 2011.

Dua

and Graff

, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

10.

Gad

, Hosahalli

, Manjunatha

B.R.

, et al. A robust deep learning model for missing value imputation in a big NCDC dataset, Iran J Comput Sci (2020). https://doi.org/10.1007/s42044-020-00065-z

11.

Gong

and Gu

, A novel SMOTE-based classification approach to online data imbalance problem, Mathematical Problems in Engineering, 2016.

12.

Hassani

, Huang

and Silva

, Big Data and climate change, Big Data and Cognitive Computing 3(1) (2019), 12.

13.

Ritchie

and Roser

(2020) - “CO2 and Greenhouse Gas Emissions". Published online at OurWorldInData.org. Retrieved from: ‘https://ourworldindata.org/co2-and-othergreenhouse-gas-emissions’ [Online Resource]

14.

Climate and Average Weather Year Round at Chennai International Airport India. Published online at www.weatherspark.com. Retrieved from : ‘https://weatherspark.com/y/149014/Average-Weather-at-Chennai-International-Airport-India-Year-Round#Sections-Sources’

15.

Rain in Australia: Predict next-day rain in Australia, Published online at www.kaggle.com. Retrieved from: https://www.kaggle.com/jsphyg/weather-dataset-rattlepackage. Version 2 [2020].

16.

World Bank Climate Change Data, From World Bank Open Data. Published online at https://www.kaggle.com. Retrieved from: https://www.kaggle.com/theworldbank/world-bank-climate-change-data. Version 38 [2018]

17.

National Centers for Environmental Information. Published online at https://www.ncdc.noaa.gov. Retrieved from https://www.ncdc.noaa.gov/cdo-web/datasets

18.

IPCC, 2021: Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, https://www.ipcc.ch/report/ar6/wg1/downloads/report/IPCC AR6 WGI Full Report.pdf, [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press. In Press.

19.

Japkowicz

and Stephen

, The class imbalance problem: A systematic study, Intell Data Analysis 6(5) (2002), pp. 429–449.

20.

Jiao

, Zhao

, Lin

and Liang

, Residual joint adaptation adversarial network for intelligent transfer fault diagnosis, Mechanical Systems and Signal Processing 145 (2020), 106962.

21.

Johnson

J.M.

and Khoshgoftaar

T.M.

, Survey on deep learning with class imbalance, J Big Data 6(27) (2019). https://doi.org/10.1186/s40537-019-0192-5

22.

Krawczyk

, Learning from imbalanced data: open challenges and future directions, Prog Artif Intell 5 (2016), 221–232. https://doi.org/10.1007/s13748-016-0094-0

23.

Lee

, Jun

C.H.

and Lee

J.S.

, Instance categorization by support vector machines to adjust weights in AdaBoost for imbalanced data classification, Information Sciences 381 (2017), 92–103.

24.

Liu

, Chen

, Yan

and Wang

, CBN-VAE: A data compression model with efficient convolutional structure for wireless sensor networks, Sensors 19(16) (2019), 3445.

25.

Malvoni

, De Giorgi

M.G.

and Congedo

P.M.

, Forecasting of PV Power Generation using weather input data preprocessing techniques, Energy Procedia 126 (2017), 651–658.

26.

Maurya

C.K.

, Toshniwal

and Venkoparao

G.V.

, Online sparse class imbalance learning on big data, Neurocomputing 216 (2016), 250–260.

27.

Pulwarty

R.S.

and Sivakumar

M.V.

, Information systems in a changing climate: Early warnings and drought risk management, Weather and Climate Extremes 3 (2014), 14–21.

28.

Batuwita

and Palade

, FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning, in IEEE Transactions on Fuzzy Systems 18(3), pp. 558–571, June 2010, doi: 10.1109/TFUZZ.2010.2042721.

29.

Saba

, Rehman

and AlGhamdi

J.S.

, Weather forecasting based on hybrid neural model, Appl Water Sci 7 (2017), 3869–3874. https://doi.org/10.1007/s13201-017-0538-0

30.

Sivaneasan

, Yu

C.Y.

and Goh

K.P.

, Solar forecasting using ANN with fuzzy logic pre-processing, Energy Procedia 143 (2017), 727–732.

31.

Khoshgoftaar

T.M.

, Van Hulse

and Napolitano

, Comparing boosting and bagging techniques with noisy and imbalanced data, IEEE Trans Syst, Man, Cybern. A, Syst, Humans 41(3), pp. 552–568, May 2011.

32.

Wang

, Gao

, Wan

, Li

and Hu

, Anovel restricted Boltzmann machine training algorithm with fast Gibbs sampling policy, Mathematical Problems in Engineering 2020.

33.

, Huang

, Fan

, Ma

, Zhou

and Zeng

, Hybrid extreme learning machine with meta-heuristic algorithms for monthly pan evaporation prediction, Computers and Electronics in Agriculture 168 (2020), 105115.

34.

Zareapoor

, Shamsolmoali

and Yang

, Oversampling adversarial network for class-imbalanced fault diagnosis, Mechanical Systems and Signal Processing 149 (2021), 107175.

35.

Zhai

, Qi

and Zhang

, Binary Imbalanced Data Classification Based on Modified D2GAN Oversampling and Classifier Fusion, IEEE Access 8 (2020), 169456–169469.

36.

Zhang

, Yang

and Jiang

, Imbalanced biomedical data classification using self-adaptive multilayerELMcombined with dynamic GAN. BioMed Eng OnLine 17 (2018), 181. https://doi.org/10.1186/s12938-018-0604-3

37.

Zimmerman

D.W.

, Teacher’s corner: A note on interpretation of the paired-samples T test, Journal of Educational and Behavioral Statistics 22(3) (1997), 349–360.

38.

Zong

, Huang

G.B.

and Chen

, Weighted extreme learning machine for imbalance learning, Neurocomputing 101 (2013), 229–242.

Effective management of class imbalance problem in climate data analysis using a hybrid of deep learning and data level sampling

Abstract

Keywords

1 Introduction

2 Related work

3 Proposed Work

4.1 AutoEncoder (AE)

4.3 ELM - Extreme Learning Machine

5.1 Datasets

5.2.1 True Positive Rate(TPR)

Table 2 Training and Testing Accuracy of ELM with AutoEncoder as the preprocessor No. of. hidden neurons Training Accuracy (%) Testing Accuracy (%) 20 88.33 84.81 40 91.2 89.14 60 92.33 90.12 80 93.4 91.36 100 94.67 93.09

References

Table 2
Training and Testing Accuracy of ELM with AutoEncoder as the preprocessor

No. of. hidden neurons Training Accuracy (%) Testing Accuracy (%)

20 88.33 84.81

40 91.2 89.14

60 92.33 90.12

80 93.4 91.36

100 94.67 93.09