Convolutional neural networks applied to dissolved gas analysis for power transformers condition monitoring

Abstract

In this contribution a methodology to diagnose transformer faults based on Dissolved Gas Analysis (DGA) by using a convolutional neural network (CNN) is proposed. The algorithm to transform the gas contents (resulting from the DGA analysis) into feature maps is introduced, and the resulting feature maps are the input of the CNN. In order to take into account the fact that the data set is imbalanced, the improved Synthetic Minority Over-Sampling Technique (SMOTE) is combined with the data cleaning technique to protect the CNN from training bias. The effect of the CNN architecture on the classification performance is also investigated to determine the optimal CNN parameters. All the above mentioned possibilities are tested and their performance investigated; in addition, a final test on the IEC TC 10 transformer fault database validates the accuracy and the generalization potential of the proposed methodology.

Keywords

Convolutional neural network deep learning DGA fault diagnosis SMOTE transformer

1. Introduction

Power transformers are a key element in the electric energy generation/transmission system, and an outage of a power transformer may lead to heavy consequences under many point of views, for this reason the concept of condition monitoring and predictive maintenance are fundamental. In addition, the detection of incipient faults inside transformers is a key aspect for the economical optimization of the maintenance plan; some incipient faults may appear due to electrical, thermal, or mechanical damages, and transformers operating in an abnormal state may worsen their conditions and turn incipient faults into catastrophic failures. Consequently, the detection and elimination of incipient faults inside the transformer is of great importance. However, incipient faults inside transformers are hard to be detected by maintenance personnel with a visual inspection. Thanks to the development of online monitoring technology, predictive diagnostics can be performed: Dissolved Gas Analysis (DGA) is the most commonly used online monitoring technology owing to its convenience and economy.

The DGA analyses the content of several gases, such as H₂, CH₄, C₂H₆, C₂H₄, and C₂H₂ in oil insulated transformers; the mapping relationship between fault types and the detected gases is not easy to be established since these fault gases can also be generated during the regular operation of transformers. Traditional DGA methods, such as the Duval triangle method [1] and the IEC three-ratio method [2], adopt the ratios between two fault gases and distinguish different fault types by setting thresholds. However, the investigation made by other researchers confirmed that the diagnosis accuracy of these traditional methods sometimes can’t meet the requirement [3].

With the rapid development of machine learning, an increasing number of researchers devoted efforts to diagnosing transformer faults with machine learning based techniques, such as multilayer perceptron (MLP) [4,5], support vector machine (SVM) [6,7], and ensemble learning [8,9]. With respect to traditional transformer diagnosis methods based on some predefined rules, methods based on machine learning rely on training the model with fault samples; this approach, however is not problem exempt. The most two common problems are the imbalance of the data set and the feature selection. The sample sizes of different categories are usually uneven, and an unbalanced training data set may cause a bias in the learning model, leading to high accuracy for majorities (categories having more samples than the average) but low accuracy for minorities (categories having less samples than the average) [10]. As for feature selection, it has been confirmed that using gas ratios as model inputs is more efficient than using original gas contents [11,12]. However, it’s still unclear what is the optimal gas ratio combination.

Some methods have been proposed by researchers to solve the unbalanced classification problems. The simplest method to address the imbalance in the data set is oversampling the minorities and/or undersampling the majorities. However, the oversampling of the minorities may cause overfitting of the classifier, and the undersampling of the majorities may lead to the loss of some important information hidden in the samples of the majorities. To overcome the defects of both oversampling and undersampling, a new method called Synthetic Minority Over-sampling Technique (SMOTE) was developed and has been proven to be efficient in dealing with unbalanced data sets [13]. However, it is also confirmed that the SMOTE may increase the overlapping between the minorities and the majorities, especially when label noise in the data set exists [14]. Consequently, in this paper, the SMOTE is combined with a data cleaning technique to solve the imbalance issue in the transformer fault data set. In addition, in order to decrease the probability of the clustering of the synthetic samples, the improved SMOTE is employed, which is characterised by space interpolation instead of the original linear interpolation.

There are two strategies for feature selection: filter algorithms and wrapper algorithms [15] and they have both been applied to transformer fault diagnosis. The wrapper algorithms are adopted to select optimal features in [16–18] while a hybrid feature selection approach of combining filter algorithms and wrapper algorithms is proposed in [19]. However, both filter and wrapper algorithms have their respective deficiencies; the mutual effects among different features are neglected in filter algorithms, leading to a sub-optimal feature combination. The wrapper algorithm is difficult to apply to large-size feature set since the computational expense scales exponentially with the size of the feature set.

An alternative solution to feature selection is using deep learning, especially convolutional neural network (CNN), whose most outstanding advantages is their automatic feature extraction [20]. The convolutional layers and pooling layers in CNN can automatically compress the feature dimension by dropping out irrelevant and redundant features, and only effective features will be sent to the fully connected layers. Some research related to the application of CNN in transformer fault diagnosis has been done [21–25]; in particular the use of a CNN is combined with transfer learning to solve the difficulty of collecting high-quality transformer fault samples [21]. Similarly, federated learning is introduced in order to train a CNN with small-size training data set [22]. A transformer diagnosis method based on a combination between an improved CNN and adaptive synthetic oversampling (ADASYN) is proposed in [23] while the analysis of different noise levels effects is carried out in [24]. A thorough comparison between CNN and other common classifiers is carried out in [25], showing the effectiveness of CNN based approaches.

Based on the aforementioned techniques, a novel methodology to diagnose transformer faults from the unbalanced DGA data set is proposed in this paper. First, the improved SMOTE combined with the data cleaning technique is used to solve the imbalance of the DGA data set avoiding the bias of the classifier. Second, a CNN is trained by the rebalanced transformer fault data set. The numerical results show good accuracy of the proposed methodology.

The main contributions of this paper are summarized as follows:

A strategy to solve the imbalance of the data set by combining the improved SMOTE and the data cleaning technique.

A novel framework to generate feature maps for CNN from original DGA data.

A methodology for condition monitoring of power transformers characterized by high accuracy based on the above two improvements.

The remainder of this paper is organized as follows: Section 2 describes procedure of using the improved SMOTE and data cleaning technique to solve the imbalance of training data set. Section 3 explains the steps of constructing a CNN-based transformer fault diagnosis model. Section 4 presents the application of the proposed method and Section 5 concludes the paper.

2. Data pre-processing for unbalanced data set

The imbalance of the data set is very common in classification problems; Table 1 shows one of the two data sets used in this contribution: it has been obtained by data collected from a power supply company in China and the available literature [26–31]; the table shows the number of available samples (a sample is composed by 5 features, each of them being the concentration of one of the 5 gases mentioned before) for each fault category, respectively Partial Discharge (PD), Low–Middle temperature Overheating (L-M-O), High temperature Overheating (H-O), Low energy Discharge (L-D) and High energy Discharge (H-D). It is clear that the data set is unbalanced and this leads to a bias in classifying the unknown sample into one of the majority category, consequently the classification accuracy of minorities will probably much lower than the average classification accuracy. Consequently, data pre-processing is necessary before using the data set to train the classifier.

The data pre-processing in this paper includes data cleaning and the SMOTE, the latter used for the rebalance of the data set. However, it is known that the SMOTE may increase the overlapping between the majorities and the minorities, especially when label noise exists [14]. For this reason a prior data cleaning technique is adopted to filter the label noise so that the synthetic samples generated by SMOTE will not be generated around noisy samples.

Table 1
The sample size of each category in the used data set

Category The number of samples

PD 26

L-M-O 109

H-O 391

L-D 120

H-D 203

Total 849

Category	The number of samples
PD	26
L-M-O	109
H-O	391
L-D	120
H-D	203
Total	849

PD: partial discharge; L-M-O: low-middle temperature overheating; H-O: high temperature overheating; L-D: low energy discharge; H-D: high energy discharge.

2.1. Data cleaning

The data cleaning procedure is implemented in order to filter the abnormal samples obtained by the feature combination. The rationale behind a data cleaning procedure lays in the fact that samples belonging to the same category (hence having the same label) are usually clustered together in the feature space. If a sample is far from its corresponding cluster, the probability that this sample is abnormal is relatively large. In classification problems, the most common sources for abnormal samples are two: one is the presence of wrongly labeled samples, and the other one is due to measurement errors (leading to one or more wrongly measured feature).

It is assumed that samples with the same label satisfy the Gaussian distribution in each dimension of the feature space. In this respect, the abnormal samples can be discriminated by the 3σ rule. According to the probability distribution density function of the Gaussian distribution, there exists the following probability: $\begin{eqnarray}\displaystyle Pr({\mu}-3{\sigma}\leq x\leq {\mu}+3{\sigma})\approx 0.997 & & \displaystyle\end{eqnarray}$ (1) where μ and σ are the mean and the standard deviation, respectively. Equation (1) means that the probability that a sample is in the range [μ − 3σ, μ + 3σ] is 99.7%, consequently the probability that a sample does not fall into this range is very low. At the same time the probability that samples outside the range [μ − 3σ, μ + 3σ] are noisy samples is much greater than the one of regular samples. Consequently, the abnormal samples can be discriminated using this rule.

For a high dimension (>1) feature space, as in this application, the aforementioned 3σ rule is applied dimension by dimension: if one feature does not satisfy the rule, the corresponding sample is considered to be abnormal. Following this data cleaning technique, wrongly labelled samples are (with high probability) purged, while samples affected by measurement uncertainties are often conserved. This is not a problem, since it has been demonstrated in [32] that multiple classifiers systems perform better with noisy data, due to the increment of the capability of generalization when the classifier is built.

At the end of the cleaning procedure performed on the data set described in Table 1, the number of samples is reduced to 23 (PD), 105 (L-M-O), 382 (H-O), 113 (L-D), 194 (H-D).

2.2. Synthetic minority over-sampling technique (SMOTE)

If the training data are unbalanced, the trained classifier will be subject to bias and classification accuracy of the majorities will be higher than the one of the minorities. Taking an abnormal product identification problem as an example, if only about 2% of the products are abnormal and classification accuracy is used as the evaluation criterion of the classifier, the classifier has the tendency to classify these products into normal products to achieve a higher classification accuracy. Consequently, even if the classifier classifies all samples as normal, the classification accuracy is still as high as 98%. But such classifier can not identify any abnormal sample.

In the field of imbalanced classification, the SMOTE has been proven to be efficient in solving the bias of classifiers [10]. The SMOTE is similar to the oversampling of minorities, however the oversampling of the minorities balances training data by simply duplicating the available samples, while SMOTE operates by generating synthetic samples around the existing ones. It has been demonstrated that over-sampling of the minorities easily leads to the overfitting of the classifiers [13], while the adoption of SMOTE yields results less prone to over-sampling.

A modified version of the technique, called improved SMOTE, is proposed in [13]; in this improved version, a new interpolation method, called space interpolation, is used. The details about the improved SMOTE are described in [33] where it is shown that synthetic samples generated by the improved SMOTE are with a higher degree of dispersion while the ones generated by the original SMOTE are arranged approximately in a line style. In summary, the synthetic samples generated by the improved SMOTE are evenly distributed in the feature space around the cluster center of each category, reducing the overlapping of synthetic samples, which is beneficial to the generalization of classifiers trained with this data set.

3. Formulation of the fault diagnosis model

A CNN is a special kind of classifier used to classify images and perform object recognition, [20,34]. One of the most valuable advantages of CNN is its automatic feature extraction ability. Usually, there are a lot of irrelevant features in given graphs, but CNN can filter irrelevant features and compress feature dimensions. As explained before, each sample is composed by five gas contents; in order to transform this dataset in something that can be used by a CNN, the authors developed a procedure to obtain 2D grayscale figures, that can be used by the CNN, from the raw data. This section is organized as follows: first, the framework to transfer DGA data into graphs is introduced; second, the structure of CNN used in this study is illustrated.

3.1. Transformation from DGA to feature maps

The original features are the contents of five fault gases: H₂, CH₄, C₂H₆, C₂H₄, and C₂H₂, so each single sample of Table 1 is composed by 5 features. In order to obtain a data set suitable for a CNN, the authors define a set of new data obtained by adding up a number of i gases, with i = 1–4; this defines the coefficients $C_{5}^{i}$ (i = 1–4) in which, for instance $C_{5}^{2}$ is the sum of two gas features etc. Afterwards, all possible ratios between the $C_{5}^{i}$ coefficients are considered, as shown in Table 2. A few additional manipulations are needed: first, to avoid a meaningless mathematical operation, the elements with zero value in the gas content vector should be replaced with a small value, where zero value means a kind of gas is not detected in transformer oil; second, the gas ratios are scaled by using the log function, and the diagonal elements of the matrix are assigned with zero value since the numerator and denominator coincident; third, before assembling the grayscale graph, the element values should be scaled into the range of [0, 255] since, according to the usual coding of grayscale images, the value of each pixel is in the range of [0, 255]. The details about transferring DGA data into grayscales are shown in Algorithm 1.

Algorithm 1: The pseudocodes to transfer 11 DGA data into grayscale graphs

A feature map generated by Algorithm 1 is a grayscale graph that contains information about all possible gas ratios obtained from the contents of five fault gases, including H₂, CH₄, C₂H₆, C₂H₄, and C₂H₂. To generate a feature map, the gas content vector X, which has a length of 30, should be prepared first. This is done by adding up different combinations of the five fault gases, and then considering all possible ratios between them. The grayscale graph is then generated by scaling the matrix of gas ratios into the range of [0, 255], where the maximum and minimum values of the matrix are determined, and the matrix is normalized accordingly. The resulting matrix is then used to draw the grayscale graph. The feature map, which is a representation of the normalized grayscale graph, is then used as the input for the CNN. In summary, the feature map is a visual representation of the gas ratios that can be used as input to the CNN for transformer fault diagnosis.

Table 2
All possible gas ratios

Numerator

Denominator $C_{5}^{1}$ $C_{5}^{2}$ $C_{5}^{3}$ $C_{5}^{4}$

$C_{5}^{1}$ $C_{5}^{1}/C_{5}^{1}$ $C_{5}^{2}/C_{5}^{1}$ $C_{5}^{3}/C_{5}^{1}$ $C_{5}^{4}/C_{5}^{1}$

$C_{5}^{2}$ $C_{5}^{1}/C_{5}^{2}$ $C_{5}^{2}/C_{5}^{2}$ $C_{5}^{3}/C_{5}^{2}$ $C_{5}^{4}/C_{5}^{2}$

$C_{5}^{3}$ $C_{5}^{1}/C_{5}^{3}$ $C_{5}^{2}/C_{5}^{3}$ $C_{5}^{3}/C_{5}^{3}$ $C_{5}^{4}/C_{5}^{3}$

$C_{5}^{4}$ $C_{5}^{1}/C_{5}^{4}$ $C_{5}^{2}/C_{5}^{4}$ $C_{5}^{3}/C_{5}^{4}$ $C_{5}^{4}/C_{5}^{4}$

	Numerator
$C_{5}^{1}$	$C_{5}^{1}/C_{5}^{1}$	$C_{5}^{2}/C_{5}^{1}$	$C_{5}^{3}/C_{5}^{1}$	$C_{5}^{4}/C_{5}^{1}$
$C_{5}^{2}$	$C_{5}^{1}/C_{5}^{2}$	$C_{5}^{2}/C_{5}^{2}$	$C_{5}^{3}/C_{5}^{2}$	$C_{5}^{4}/C_{5}^{2}$
$C_{5}^{3}$	$C_{5}^{1}/C_{5}^{3}$	$C_{5}^{2}/C_{5}^{3}$	$C_{5}^{3}/C_{5}^{3}$	$C_{5}^{4}/C_{5}^{3}$
$C_{5}^{4}$	$C_{5}^{1}/C_{5}^{4}$	$C_{5}^{2}/C_{5}^{4}$	$C_{5}^{3}/C_{5}^{4}$	$C_{5}^{4}/C_{5}^{4}$

$C_{5}^{1}$ : any feature gas; $C_{5}^{2}$ : the sum of any two feature gases; $C_{5}^{3}$ : the sum of any three feature gases; $C_{5}^{4}$ : the sum of any four feature gases; $C_{5}^{5}$ : the sum of all five feature gases.

From Algorithm 1, it is seen that the obtained grayscale graph is in 30 × 30 pixels, and some examples are shown in Fig. 1. The obtained grayscale images, also called feature maps, are the input of the CNN.

Fig. 1.

Some representative examples of grayscale graphs. (a) PD. (b) L-M-O. (c) H-O. (d) L-D. (e) H-D.

Fig. 2.

The architecture of the CNN used to build the diagnosis model.

3.2. CNN-based classifier

Usually, a CNN consists of convolutional layers, pooling layers and fully connected layers. The architecture of the CNN used in the proposed methodology is shown in Fig. 2. The 2D feature maps will be condensed into a 1D feature vector after the operation of convolutional layers and max pooling layers, and the feature vector will be the input to fully connected layers to obtain the final classification results.

Convolutional layer: The convolutional layer is the key layer for feature extraction, and performs convolution operation. The architecture of a convolutional layer is uniquely determined by convolutional kernels. A convolutional kernel should include at least three parameters, including depth, width and height. The depth is the same as the number of feature maps, and the width and height are usually in the size of 3 × 3 or 5 × 5. The latter is chosen in the proposed model, with no padding and stride equal to 1; the rectified linear unit function (ReLU) is used as the activation function of the convolutional layers, and its expression is as follows $\begin{eqnarray}\displaystyle ReLU(x)=\max (0,x). & & \displaystyle\end{eqnarray}$ (2)

The output of a convolutional layer is a condensed feature map, obtained by convolution operation using convolutional filters. Simply speaking, each pixel can be viewed as the weighted sum of pixels in a specific area of original feature map. Consequently, in terms physical meaning, features obtained by convolving the input data can be thought of as a weighted sum of fault gas ratios.

Pooling layer: The purpose of pooling layers is to reduce the size of the feature map. The general filter is of size of 2 × 2. A N × N feature map is transferred into a N∕2 × N∕2 feature map after a pooling layer, which is beneficial to reduce the computational costs to train the CNN. There are two commonly used pooling filters, called max pooling and average pooling. The optimal pooling filter is dependent on the data and its features [35]. Consequently, a performance comparison test between max pooling and average pooling will be conducted in the following section.

Fully connected layer: After the operation of convolutional layers and pooling layers, the redundant and irrelevant features are abandoned. The fully connected layer has the role of classifying samples into several categories based on the condensed features. Actually, fully connected layers can be treated as a multilayer perceptrons (MLP). The number of neurons in the first layer of MLP is determined by the length of the input vector, and that in the last layer is decided by the classification problems. The activation function is the Softmax function for the last layer and the hyperbolic tangent function for the other layers.

Cross entropy loss function is used to quantify the difference between the outputs of CNN and the targets, and the CNN is trained by using the Adam algorithm. The corresponding program is developed on Pytorch.

4. Numerical tests and discussions

A set of numerical tests are performed in order to clarify the advantages and disadvantages of the proposed methodology. The samples used to train and test the CNN performance are collected from a power supply company in China and public literature [26–31], and the sample size distribution is shown in Table 1. The effect of the CNN architecture on the performance of the diagnosis model is studied to define its optimal architecture; afterwards the proposed methodology is applied to the public IEC TC 10 transformer fault database [36] so that the generalization potential of the proposed methodology is verified.

In the following numerical experiments, 10-fold cross validation is used to select the optimal CNN architecture. The data set shown in Table 1 is randomly divided into 10 subsets. The CNN is trained 10 times, each time using 9 subsets for training and 1 subset for testing. The loss and accuracy during the training process are then averaged. The learning rate of the training algorithm is 0.001. The training is stopped after 10 epochs.

4.1. The effect of the architecture of the CNN on the classification performance

4.1.1. Average pooling vs. max pooling

There are two commonly used pooling filters, average pooling and max pooling. The optimal choice depends on the characteristic of the input feature maps. In other words, the choice of pooling filter is problem-dependent. In order to determine the best pooling filter for the specific application, a comparison test is made; the architecture of the CNN is shown in Fig. 2.

The comparison results are shown in Fig. 3. In order to clearly outline the variation of the training loss, it is calculated in every 100 iterations instead of in every epoch. Only one feature map is the input of the CNN in each iteration, consequently each epoch contains N_train iterations, where N_train is the sample size of the training set. It can be concluded that the max pooling performs slightly better than the average pooling. First, the final loss when using max pooling is smaller than the one by using average pooling, meaning that the difference between the outputs and the targets is smaller when max pooling is used. Second, both the training accuracy and the test accuracy are higher when using max pooling. However, it should be noticed that the fluctuation is more evident during the training process when max pooling is used, which can be concluded from the variation in training accuracy.

Fig. 3.

The performance comparison between the max pooling and average pooling. (a) Loss. (b) Training accuracy and test accuracy.

In summary, the average pooling is more stable than the max pooling but with lower accuracy. The max pooling has a higher probability to train a better CNN for the studied problem, consequently, the max pooling filter is used in the following test.

4.1.2. Architecture of convolutional layer

The convolutional layer is the core of the CNN. In order to investigate the effect of the number of convolutional layers on the accuracy of the results, the number of convolutional layers is varied from 2 to 4.

From the comparison results shown in Fig. 4(a), it can be seen that the loss is irrelative to the number of convolutional layers, and the structure with 3 convolutional layers achieves both the highest training accuracy and test accuracy. It is also found from Fig. 4(b) that when a 2-layer structure is used, the difference between the training accuracy and the test accuracy is smaller than others. However, the final training accuracy and test accuracy of a 2-layer is lower than that of a 3-layer structure, meaning there exists underfitting in 2-layer structure. The situation is the opposite when using a 4-layer structure, where the difference between the training accuracy and the test accuracy is at a significant level. Apparently, the overfitting appears in the 4-layer structure.

Fig. 4.

The performance comparison of different convolutional layer architectures. (a) Loss. (b) Training accuracy and test accuracy.

In summary, the 3-layer structure for the convolutional layer is the optimal choice to protect the CNN from both underfitting and overfitting.

4.1.3. Architecture of fully connected layer

The function of the fully connected layer is to generate the mapping relationship between the features extracted by convolutional layers and pooling layers and the targets. In order to determine the optimal structure of the fully connected layer, a comparison test is made. The number of fully connected layers is varied from 1 to 3. The number of neurons is 5 (1-layer), 50-5 (2-layer), and 100-20-5 (3-layer) respectively. The comparison results displayed in Fig. 5 demonstrate that the loss, training accuracy, and test accuracy exhibit minimal differences between 1-layer, 2-layer, and 3-layer structures. Therefore, it can be inferred that the number of layers in the fully connected layer structure has an insignificant impact on the performance of the CNN.

Fig. 5.

The performance comparison of different fully connected layer architectures. (a) Loss. (b) Training accuracy and test accuracy.

In summary, considering the stability and the computational cost of the CNN, a single fully connected layer is chosen.

4.2. The generalization capabilities

In the above tests, the training set and the test set are collected from the power supply company and the public literature. In order to validate the generalization potential of the proposed methodology, the test set is replaced by IEC TC 10 transformer data set [36]. The sample size distribution is shown in Table 3.

Table 3
Sample size of each category in IEC TC 10 database

Category The number of samples

PD 2

L-M-O 10

H-O 14

L-D 23

H-D 45

Total 94

Category	The number of samples
PD	2
L-M-O	10
H-O	14
L-D	23
H-D	45
Total	94

The architecture of the CNN is based on the investigation made in the above subsection. Specifically, a 3-layers structure for the convolutional layer connected by max pooling layers a 1 fully connected layer are used. The test results of the proposed methodology are shown in Fig. 6.

Fig. 6.

The performance of the proposed methodology without using the improved SMOTE on public IEC TC 10 database. (a) Loss. (b) Training accuracy and test accuracy. (c) Confusion matrix of training set. (d) Confusion matrix of test set.

From Fig. 6(a) it is evident that the training process converges within 10 epochs and reaches a stable loss. The differences between the training accuracy and the test accuracy reduce as the training process proceeds. The final difference between the training accuracy and the test accuracy is below 3%, meaning that the proposed methodology achieves great generalization. Figure 6(c,d) show the confusion matrices of the training results and the test results. According to the definition of the confusion matrix, rows represent the real labels of samples and columns represent the labels predicted by the CNN. The diagnosis accuracy of each fault type can be calculated based on the confusion matrix, and the result is shown in Table 4.

In order to observe the influence of sample size on classification accuracy, the sample size of each fault type in the original training set is also shown in Table 4. It can be seen that fault types with small sample sizes achieve low diagnosis accuracy (both precision rate and recall rate), especially PD and L-M-O. This phenomenon appears both in the training set and the test set. In other words, there exist some bias in the trained CNN, which makes the CNN mainly focuses on the accurate classification of the classes with large sample sizes. In order to remove the bias of the CNN, the improved SMOTE is applied to generate some synthetic samples to make the sample size of each fault type even.

The results obtained by combining the improved SMOTE with the CNN are shown in Fig. 7. Similarly, the training process also can converge within 10 epochs, and the difference between the training accuracy and the test accuracy is still below 3%. Consequently, the proposed model still has great generalization after the improved SMOTE is introduced. The conclusion can be obtained by comparing the precision rate and the recall rate before and after the improved SMOTE is used. Apparently, the training accuracies (both the precision rate and the recall rate) of the minorities are improved, meaning that the improved SMOTE is effective to remove the bias of the CNN produced in the training process. However, it should also be noted that the training accuracy of the whole training set may slightly degrade.

For a classification problem, apart from the accuracy, two other important indices to evaluate the performance of a classifier are precision rate and recall rate. In a multi-classification problem, these two indices can be calculated as follows $\begin{eqnarray}\displaystyle \begin{array}{@{}l@{}}\text{Precision rate:}∼P(i)={\displaystyle \frac{\mathit{TP}(i)}{N_{P}(i)}}\\[15.0pt] \text{Recall rate:}∼R(i)={\displaystyle \frac{\mathit{TP}(i)}{N_{R}(i)}}\end{array} & & \displaystyle\end{eqnarray}$ (3) where P (i) and R (i) represent the precision rate and recall rate of the ith class respectively. TP (i) represents the number of correctly classified samples belonging to the ith class, while N_P(i) and N_R(i) represent the number of samples belonging to the ith class in the classifier’s classification results and the test dataset, respectively.

Table 4

The precision rate and recall rate distribution

		CNN without using SMOTE				CNN with using SMOTE
		Training set		Test set		Training set		Test set
Class	Sample size	P	R	P	R	P	R	P	R
PD	26	72.7%	61.5%	66.7%	100.0%	81.1%	88.7%	66.7%	100.0%
L-M-O	109	83.3%	78.0%	77.8%	70.0%	88.6%	85.2%	69.2%	90.0%
L-D	120	80.8%	84.2%	95.2%	87.0%	82.2%	83.6%	95.0%	82.6%
H-O	391	95.4%	95.9%	80.0%	85.7%	93.8%	89.2%	90.9%	71.4%
H-D	203	89.4%	91.1%	95.7%	97.8%	89.5%	87.2%	93.6%	97.8%
Accuracy		89.8%		90.4%		86.8%		89.4%

P: precision rate; R: recall rate.

Fig. 7.

The performance of the proposed methodology with using the improved SMOTE on public IEC TC 10. (a) Loss. (b) Training accuracy and test accuracy. (c) Confusion matrix of training set. (d) Confusion matrix of test set.

The test accuracies with or without using the improved SMOTE are close. However, the precision rate and the recall rate distribution are different. Usually, for a specific class, only the precision rate or the recall rate can be improved, and the other one will slightly degrade. The effect of the SMOTE on the CNN should be further studied, especially the problem of how to achieve the balance between the precision rate and the recall rate after the SMOTE is used.

Table 5

The comparisons of the diagnosis results of IEC TC 10 database obtained by different methods

Fault type	Sample size	This paper	[16]	[18]	[19]
PD	2	2	0	2	2
LD	23	19	14	13	18
HD	45	44	44	44	43
LMO	10	9	7	7	7
HO	14	10	13	13	12
Average accuracy	/	89.4%	83.0%	84.0%	87.2%

Transformer faults can be roughly categorized into two types: overheating faults and discharge faults. The proposed methodology achieves 100% accuracy in distinguishing between these two types, as shown in Figs 7(d) and 6(d). However, there may be some prediction errors when evaluating the severity of the faults. For instance, some samples of high temperature overheating may be mislabeled as low-middle temperature overheating fault. This is because there is no explicit boundary between these types of faults. Similarly, there may be confusion between high-energy discharge, low-energy discharge, and particle discharge. Nevertheless, the methodology achieves an approximately 90% accuracy in evaluating the severity of the faults and a 100% accuracy in recognizing the type of fault, which is satisfactory.

A comparison is conducted between the proposed methodology and previous studies. The main advantage of using CNN is its ability to automatically select effective features. Therefore, the compared studies were previous intelligent transformer fault diagnosis methods that also employed feature selection. All the vital characteristics of methods proposed in papers [16,18,19], including the used classifier (SVM) and the adopted optimal feature combination, are retained when conducting the comparison. The compared models are trained using the same data set, and the final test results are shown in Table 5. The proposed methodology outperforms previous studies, achieving a diagnosis accuracy of 89.4%, while the methods proposed in papers [16,18,19] achieve 83.0%, 84.0%, 87.2%. The comparison results confirm the accuracy of the proposed methodology.

5. Conclusion

A methodology to diagnose transformer faults based on DGA by using CNN is proposed. Considering the unbalance of the training data set, the improved SMOTE is combined with the data cleaning technique to generate synthetic samples to make the sample size of each fault type even.

The main outcome of this study are given below:

The architecture of the CNN has an influence on the classification performance.

For the CNN used in the study, the max pooling has a higher potential to achieve better classification accuracy.

Choosing an appropriate number of layers for the convolutional layer can protect the CNN from both underfitting and overfitting. For example, a 3-layer structure is an optimal choice for the proposed model.

The number of layers of the fully connected layer has insignificant effect on the performance of the CNN. The CNN may benefit from choosing a simple structure for the fully connected layer.

The proposed methodology achieves great generalization with the difference between the training accuracy and the test accuracy below 3%.

The introduction of the improved SMOTE can effectively remove the bias of the CNN caused by the imbalanced training data set. However, the strategy to achieve the balance between the precision rate and the recall rate should be further studied.

References

Duval

, A review of faults detectable by gas-in-oil analysis in transformers, IEEE Electr. Insul. Mag.18 (2022), 8–17.

International Electrotechnical Commission, Mineral oil-filled electrical equipment in service-Guidance on the interpretation of dissolved and free gases analysis, IEC 60599, 2022.

Gouda

O.E.

El-Hoshy

S.H.

and L-Tamaly

H.H.E.

, Proposed three ratios technique for the interpretation of mineral oil transformers based dissolved gas analysis, IET Gener. Transm. Distrib.12 (2018), 2650–2661.

Lopes

S.M.D.

Flauzino

R.A.

and Altafim

R.A.C.

, Incipient fault diagnosis in power transformers by data-driven models with over-sampled dataset, Electr. Power Syst. Res.201 (2021), 107519.

Chatterjee

Dawn

Jadoun

V.K.

and Jarial

R.K.

, Novel prediction-reliability based graphical DGA technique using multi-layer perceptron network & gas ratio combination algorithm, IET Sci. Meas. Technol.13 (2019), 836–842.

Zhang

H.R.

Sun

J.X.

Hou

K.N.

Q.Q.

and Liu

H.S.

, Improved information entropy weighted vague support vector machine method for transformer fault diagnosis, High Volt.7 (2022), 510–522.

Rao

U.M.

Fofana

Rajesh

K.N.V.P.S.

and Picher

, Identification and application of machine learning algorithms for transformer dissolved gas analysis, IEEE Trans. Dielectr. Electr. Insul.28 (2021), 1828–1835.

Paul

Goswarmi

A.K.

Chetri

R.L.

Roy

and Sen

, Bayesian optimization-based gradient boosting method of fault detection in oil-immersed transformer and reactors, IEEE Trans. Ind. Appl.58 (2022), 1910–1919.

Fan

J.M.

, Condition forecasting of a power transformer based on an online monitor with EL-CSO-ANN, Energies15 (2022), 8587.

10.

Fernandez

Garcia

Herrera

and Chawla

N.V.

, SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary, J. Artif. Intell. Res.61 (2018), 863–905.

11.

Abbasi

A.R.

, (Fault) detection and diagnosis in power transformers: A comprehensive review and classification of publications and methods, Electr. Power Syst. Res.209 (2022), 107990.

12.

Wani

S.A.

Rana

A.S.

Sohail

Rahman

Parveen

and Khan

S.A.

, Advances in DGA based condition monitoring of transformers: A review, Renew. Sust. Energ. Rev.149 (2021), 111347.

13.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res.16 (2002), 321–357.

14.

Chen

Xia

Chen

Wang

and Wang

, RSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise, Inf. Sci.553 (2021), 397–428.

15.

Hsu

H.H.

Hsieh

C.W.

and Lu

M.D.

, Hybrid feature selection by combining filters and wrappers, Expert Syst. Appl.38 (2011), 8144–8150.

16.

J.Z.

Zhang

Q.G.

Wang

J.Y.

Zhou

T.C.

and Zhang

Y.Y.

, Optimal dissolved gas ratios selected by genetic algorithm for power transformer fault diagnosis based on support vector machine, IEEE Trans. Dielectr. Electr. Insul.23 (2016), 1198–1206.

17.

Peimankar

Weddell

S.J.

Jalal

and Lapthorn

A.C.

, Evolutionary multi-objective fault diagnosis of power transformers, Swarm Evol. Comput.36 (2017), 62–75.

18.

Fang

Zheng

Liu

Zhao

Zhang

and Wang

K.J.E.

, A transformer fault diagnosis model using an optimal hybrid dissolved gas analysis features subset with improved social group optimization-support vector machine classifier, Enginies11 (2018), 1922.

19.

Kari

Gao

W.S.

Zhao

D.B.

Abiderexiti

Wang

, Hybrid feature selection approach for power transformer fault diagnosis based on support vector machine and genetic algorithm, IET Gener. Transm. Distrib.12 (2018), 5672–5680.

20.

Chen

Jiang

Jia

and Ghamisi

, Deep feature extraction and classification of hyperspectral images based on convolutional neural networks, IEEE Trans. Geosci. Remote Sens.54 (2016), 6232–6251.

21.

Lin

Zhu

J.G.

and Cui

, A transfer ensemble learning method for evaluating power transformer health conditions with limited measurement data, IEEE Trans. Instrum. Meas.71 (2022), 1–10.

22.

Lin

and Zhu

J.G.

, Hierarchical federated learning for power transformer fault diagnosis, IEEE Trans. Instrum. Meas.71 (2022), 1–11.

23.

Z.H.

Y.G.

Xing

Z.K.

and Duan

J.J.

, Transformer fault diagnosis based on improved deep coupled dense convolutional neural network, Electr. Power Syst. Res.209 (2022), 107969.

24.

Taha

I.B.M.

Ibrahim

and Mansour

D.E.A.

, Power transformer fault diagnosis based on DGA using a convolutional neural network with noise in measurements, IEEE Access9 (2021), 111162–111170.

25.

Liao

W.L.

Yang

D.C.

Wang

Y.S.

and Ren

, Fault diagnosis of power transformers using graph convolutional network, CSEE J. Power Energy Syst.7 (2021), 241–249.

26.

Zhang

Ding

Liu

and Griffin

P.J.

, An artificial neural network approach to transformer fault diagnosis, IEEE Trans. Power Del.11 (1996), 1836–1841.

27.

Vanegas

Mizuno

Naito

and Kamiya

, Diagnosis of oil-insulated power apparatus by using neural network simulation, IEEE Trans. Dielectr. Electr. Insul.4 (1997), 290–299.

28.

Gao

Zhang

G.J.

Qian

Yan

and Zhu

D.H.

, Diagnosis of DGA based on fuzzy and ANN methods, in: Proceedings of International Symposium on Electrical Insulating Materials, 1998, pp. 767–770.

29.

Siva Sarma

S.S.

and Kalyani

G.N.S.

, ANN approach for condition monitoring of power transformers using DGA, in: IEEE Region 10 Conference TENCON, 2004, pp. 444–447.

30.

Ganyun

L.V.

Cheng

H.Z.

Zhai

H.B.

and Dong

L.X.

, Fault diagnosis of power transformer based on multi-layer SVM classifier, Electr. Power Syst. Res.74 (2005), 1–7.

31.

, Dissolved gas data in transformer oil-fault diagnosis of power transformers with membership degree, IEEE Dataport, 2019.

32.

Sáez

J.A.

Galar

Luengo

and Herrera

, Tackling the problem of classification with noisy data using multiple classifier systems: analysis of the performance and robustness, Inf. Sci.247 (2013), 1–20.

33.

Rao

Zou

Yang

and Khan

S.A.

, Fault diagnosis of power transformers using ANN and SMOTE algorithm, Int. J. Appl. Electromagn. Mech.70 (2022), 345–355.

34.

Jiao

J.Y.

Zhao

Lin

and Liang

K.X.

, A comprehensive review on convolutional neural network in machine fault diagnosis, Neurocomputing417 (2020), 36–63.

35.

LeCun

Kavukcuoglu

and Farabet

, Convolutional networks and applications in vision, in: Proceedings of 2010 IEEE International Symposium on Circuits and Systems, 2010, pp. 253–256.

36.

Duval

and dePabla

, Interpretation of gas-in-oil analysis using new IEC publication 60599 and IEC TC 10 databases, IEEE Electr. Insul. Mag.17 (2001), 31–41.

	Numerator
Denominator	$C_{5}^{1}$	$C_{5}^{2}$	$C_{5}^{3}$	$C_{5}^{4}$
$C_{5}^{1}$	$C_{5}^{1}/C_{5}^{1}$	$C_{5}^{2}/C_{5}^{1}$	$C_{5}^{3}/C_{5}^{1}$	$C_{5}^{4}/C_{5}^{1}$
$C_{5}^{2}$	$C_{5}^{1}/C_{5}^{2}$	$C_{5}^{2}/C_{5}^{2}$	$C_{5}^{3}/C_{5}^{2}$	$C_{5}^{4}/C_{5}^{2}$
$C_{5}^{3}$	$C_{5}^{1}/C_{5}^{3}$	$C_{5}^{2}/C_{5}^{3}$	$C_{5}^{3}/C_{5}^{3}$	$C_{5}^{4}/C_{5}^{3}$
$C_{5}^{4}$	$C_{5}^{1}/C_{5}^{4}$	$C_{5}^{2}/C_{5}^{4}$	$C_{5}^{3}/C_{5}^{4}$	$C_{5}^{4}/C_{5}^{4}$

Convolutional neural networks applied to dissolved gas analysis for power transformers condition monitoring

Abstract

Keywords

1. Introduction

2. Data pre-processing for unbalanced data set

Table 1 The sample size of each category in the used data set Category The number of samples PD 26 L-M-O 109 H-O 391 L-D 120 H-D 203 Total 849

3. Formulation of the fault diagnosis model

3.1. Transformation from DGA to feature maps

4.1. The effect of the architecture of the CNN on the classification performance

4.1.1. Average pooling vs. max pooling

Table 3 Sample size of each category in IEC TC 10 database Category The number of samples PD 2 L-M-O 10 H-O 14 L-D 23 H-D 45 Total 94

References

Table 1
The sample size of each category in the used data set

Category The number of samples

PD 26

L-M-O 109

H-O 391

L-D 120

H-D 203

Total 849

Table 3
Sample size of each category in IEC TC 10 database

Category The number of samples

PD 2

L-M-O 10

H-O 14

L-D 23

H-D 45

Total 94