Abstract
In this contribution a methodology to diagnose transformer faults based on Dissolved Gas Analysis (DGA) by using a convolutional neural network (CNN) is proposed. The algorithm to transform the gas contents (resulting from the DGA analysis) into feature maps is introduced, and the resulting feature maps are the input of the CNN. In order to take into account the fact that the data set is imbalanced, the improved Synthetic Minority Over-Sampling Technique (SMOTE) is combined with the data cleaning technique to protect the CNN from training bias. The effect of the CNN architecture on the classification performance is also investigated to determine the optimal CNN parameters. All the above mentioned possibilities are tested and their performance investigated; in addition, a final test on the IEC TC 10 transformer fault database validates the accuracy and the generalization potential of the proposed methodology.
Introduction
Power transformers are a key element in the electric energy generation/transmission system, and an outage of a power transformer may lead to heavy consequences under many point of views, for this reason the concept of condition monitoring and predictive maintenance are fundamental. In addition, the detection of incipient faults inside transformers is a key aspect for the economical optimization of the maintenance plan; some incipient faults may appear due to electrical, thermal, or mechanical damages, and transformers operating in an abnormal state may worsen their conditions and turn incipient faults into catastrophic failures. Consequently, the detection and elimination of incipient faults inside the transformer is of great importance. However, incipient faults inside transformers are hard to be detected by maintenance personnel with a visual inspection. Thanks to the development of online monitoring technology, predictive diagnostics can be performed: Dissolved Gas Analysis (DGA) is the most commonly used online monitoring technology owing to its convenience and economy.
The DGA analyses the content of several gases, such as H2, CH4, C2H6, C2H4, and C2H2 in oil insulated transformers; the mapping relationship between fault types and the detected gases is not easy to be established since these fault gases can also be generated during the regular operation of transformers. Traditional DGA methods, such as the Duval triangle method [1] and the IEC three-ratio method [2], adopt the ratios between two fault gases and distinguish different fault types by setting thresholds. However, the investigation made by other researchers confirmed that the diagnosis accuracy of these traditional methods sometimes can’t meet the requirement [3].
With the rapid development of machine learning, an increasing number of researchers devoted efforts to diagnosing transformer faults with machine learning based techniques, such as multilayer perceptron (MLP) [4,5], support vector machine (SVM) [6,7], and ensemble learning [8,9]. With respect to traditional transformer diagnosis methods based on some predefined rules, methods based on machine learning rely on training the model with fault samples; this approach, however is not problem exempt. The most two common problems are the imbalance of the data set and the feature selection. The sample sizes of different categories are usually uneven, and an unbalanced training data set may cause a bias in the learning model, leading to high accuracy for majorities (categories having more samples than the average) but low accuracy for minorities (categories having less samples than the average) [10]. As for feature selection, it has been confirmed that using gas ratios as model inputs is more efficient than using original gas contents [11,12]. However, it’s still unclear what is the optimal gas ratio combination.
Some methods have been proposed by researchers to solve the unbalanced classification problems. The simplest method to address the imbalance in the data set is oversampling the minorities and/or undersampling the majorities. However, the oversampling of the minorities may cause overfitting of the classifier, and the undersampling of the majorities may lead to the loss of some important information hidden in the samples of the majorities. To overcome the defects of both oversampling and undersampling, a new method called Synthetic Minority Over-sampling Technique (SMOTE) was developed and has been proven to be efficient in dealing with unbalanced data sets [13]. However, it is also confirmed that the SMOTE may increase the overlapping between the minorities and the majorities, especially when label noise in the data set exists [14]. Consequently, in this paper, the SMOTE is combined with a data cleaning technique to solve the imbalance issue in the transformer fault data set. In addition, in order to decrease the probability of the clustering of the synthetic samples, the improved SMOTE is employed, which is characterised by space interpolation instead of the original linear interpolation.
There are two strategies for feature selection: filter algorithms and wrapper algorithms [15] and they have both been applied to transformer fault diagnosis. The wrapper algorithms are adopted to select optimal features in [16–18] while a hybrid feature selection approach of combining filter algorithms and wrapper algorithms is proposed in [19]. However, both filter and wrapper algorithms have their respective deficiencies; the mutual effects among different features are neglected in filter algorithms, leading to a sub-optimal feature combination. The wrapper algorithm is difficult to apply to large-size feature set since the computational expense scales exponentially with the size of the feature set.
An alternative solution to feature selection is using deep learning, especially convolutional neural network (CNN), whose most outstanding advantages is their automatic feature extraction [20]. The convolutional layers and pooling layers in CNN can automatically compress the feature dimension by dropping out irrelevant and redundant features, and only effective features will be sent to the fully connected layers. Some research related to the application of CNN in transformer fault diagnosis has been done [21–25]; in particular the use of a CNN is combined with transfer learning to solve the difficulty of collecting high-quality transformer fault samples [21]. Similarly, federated learning is introduced in order to train a CNN with small-size training data set [22]. A transformer diagnosis method based on a combination between an improved CNN and adaptive synthetic oversampling (ADASYN) is proposed in [23] while the analysis of different noise levels effects is carried out in [24]. A thorough comparison between CNN and other common classifiers is carried out in [25], showing the effectiveness of CNN based approaches.
Based on the aforementioned techniques, a novel methodology to diagnose transformer faults from the unbalanced DGA data set is proposed in this paper. First, the improved SMOTE combined with the data cleaning technique is used to solve the imbalance of the DGA data set avoiding the bias of the classifier. Second, a CNN is trained by the rebalanced transformer fault data set. The numerical results show good accuracy of the proposed methodology.
The main contributions of this paper are summarized as follows: A strategy to solve the imbalance of the data set by combining the improved SMOTE and the data cleaning technique. A novel framework to generate feature maps for CNN from original DGA data. A methodology for condition monitoring of power transformers characterized by high accuracy based on the above two improvements.
The remainder of this paper is organized as follows: Section 2 describes procedure of using the improved SMOTE and data cleaning technique to solve the imbalance of training data set. Section 3 explains the steps of constructing a CNN-based transformer fault diagnosis model. Section 4 presents the application of the proposed method and Section 5 concludes the paper.
Data pre-processing for unbalanced data set
The imbalance of the data set is very common in classification problems; Table 1 shows one of the two data sets used in this contribution: it has been obtained by data collected from a power supply company in China and the available literature [26–31]; the table shows the number of available samples (a sample is composed by 5 features, each of them being the concentration of one of the 5 gases mentioned before) for each fault category, respectively Partial Discharge (PD), Low–Middle temperature Overheating (L-M-O), High temperature Overheating (H-O), Low energy Discharge (L-D) and High energy Discharge (H-D). It is clear that the data set is unbalanced and this leads to a bias in classifying the unknown sample into one of the majority category, consequently the classification accuracy of minorities will probably much lower than the average classification accuracy. Consequently, data pre-processing is necessary before using the data set to train the classifier.
The data pre-processing in this paper includes data cleaning and the SMOTE, the latter used for the rebalance of the data set. However, it is known that the SMOTE may increase the overlapping between the majorities and the minorities, especially when label noise exists [14]. For this reason a prior data cleaning technique is adopted to filter the label noise so that the synthetic samples generated by SMOTE will not be generated around noisy samples.
The sample size of each category in the used data set
The sample size of each category in the used data set
PD: partial discharge; L-M-O: low-middle temperature overheating; H-O: high temperature overheating; L-D: low energy discharge; H-D: high energy discharge.
The data cleaning procedure is implemented in order to filter the abnormal samples obtained by the feature combination. The rationale behind a data cleaning procedure lays in the fact that samples belonging to the same category (hence having the same label) are usually clustered together in the feature space. If a sample is far from its corresponding cluster, the probability that this sample is abnormal is relatively large. In classification problems, the most common sources for abnormal samples are two: one is the presence of wrongly labeled samples, and the other one is due to measurement errors (leading to one or more wrongly measured feature).
It is assumed that samples with the same label satisfy the Gaussian distribution in each dimension of the feature space. In this respect, the abnormal samples can be discriminated by the 3σ rule. According to the probability distribution density function of the Gaussian distribution, there exists the following probability:
For a high dimension (>1) feature space, as in this application, the aforementioned 3σ rule is applied dimension by dimension: if one feature does not satisfy the rule, the corresponding sample is considered to be abnormal. Following this data cleaning technique, wrongly labelled samples are (with high probability) purged, while samples affected by measurement uncertainties are often conserved. This is not a problem, since it has been demonstrated in [32] that multiple classifiers systems perform better with noisy data, due to the increment of the capability of generalization when the classifier is built.
At the end of the cleaning procedure performed on the data set described in Table 1, the number of samples is reduced to 23 (PD), 105 (L-M-O), 382 (H-O), 113 (L-D), 194 (H-D).
If the training data are unbalanced, the trained classifier will be subject to bias and classification accuracy of the majorities will be higher than the one of the minorities. Taking an abnormal product identification problem as an example, if only about 2% of the products are abnormal and classification accuracy is used as the evaluation criterion of the classifier, the classifier has the tendency to classify these products into normal products to achieve a higher classification accuracy. Consequently, even if the classifier classifies all samples as normal, the classification accuracy is still as high as 98%. But such classifier can not identify any abnormal sample.
In the field of imbalanced classification, the SMOTE has been proven to be efficient in solving the bias of classifiers [10]. The SMOTE is similar to the oversampling of minorities, however the oversampling of the minorities balances training data by simply duplicating the available samples, while SMOTE operates by generating synthetic samples around the existing ones. It has been demonstrated that over-sampling of the minorities easily leads to the overfitting of the classifiers [13], while the adoption of SMOTE yields results less prone to over-sampling.
A modified version of the technique, called improved SMOTE, is proposed in [13]; in this improved version, a new interpolation method, called space interpolation, is used. The details about the improved SMOTE are described in [33] where it is shown that synthetic samples generated by the improved SMOTE are with a higher degree of dispersion while the ones generated by the original SMOTE are arranged approximately in a line style. In summary, the synthetic samples generated by the improved SMOTE are evenly distributed in the feature space around the cluster center of each category, reducing the overlapping of synthetic samples, which is beneficial to the generalization of classifiers trained with this data set.
Formulation of the fault diagnosis model
A CNN is a special kind of classifier used to classify images and perform object recognition, [20,34]. One of the most valuable advantages of CNN is its automatic feature extraction ability. Usually, there are a lot of irrelevant features in given graphs, but CNN can filter irrelevant features and compress feature dimensions. As explained before, each sample is composed by five gas contents; in order to transform this dataset in something that can be used by a CNN, the authors developed a procedure to obtain 2D grayscale figures, that can be used by the CNN, from the raw data. This section is organized as follows: first, the framework to transfer DGA data into graphs is introduced; second, the structure of CNN used in this study is illustrated.
Transformation from DGA to feature maps
The original features are the contents of five fault gases: H2, CH4, C2H6, C2H4, and C2H2, so each single sample of Table 1 is composed by 5 features. In order to obtain a data set suitable for a CNN, the authors define a set of new data obtained by adding up a number of i gases, with i = 1–4; this defines the coefficients
A feature map generated by Algorithm 1 is a grayscale graph that contains information about all possible gas ratios obtained from the contents of five fault gases, including H2, CH4, C2H6, C2H4, and C2H2. To generate a feature map, the gas content vector X, which has a length of 30, should be prepared first. This is done by adding up different combinations of the five fault gases, and then considering all possible ratios between them. The grayscale graph is then generated by scaling the matrix of gas ratios into the range of [0, 255], where the maximum and minimum values of the matrix are determined, and the matrix is normalized accordingly. The resulting matrix is then used to draw the grayscale graph. The feature map, which is a representation of the normalized grayscale graph, is then used as the input for the CNN. In summary, the feature map is a visual representation of the gas ratios that can be used as input to the CNN for transformer fault diagnosis.
All possible gas ratios
All possible gas ratios
From Algorithm 1, it is seen that the obtained grayscale graph is in 30 × 30 pixels, and some examples are shown in Fig. 1. The obtained grayscale images, also called feature maps, are the input of the CNN.

Some representative examples of grayscale graphs. (a) PD. (b) L-M-O. (c) H-O. (d) L-D. (e) H-D.

The architecture of the CNN used to build the diagnosis model.
Usually, a CNN consists of convolutional layers, pooling layers and fully connected layers. The architecture of the CNN used in the proposed methodology is shown in Fig. 2. The 2D feature maps will be condensed into a 1D feature vector after the operation of convolutional layers and max pooling layers, and the feature vector will be the input to fully connected layers to obtain the final classification results.
The output of a convolutional layer is a condensed feature map, obtained by convolution operation using convolutional filters. Simply speaking, each pixel can be viewed as the weighted sum of pixels in a specific area of original feature map. Consequently, in terms physical meaning, features obtained by convolving the input data can be thought of as a weighted sum of fault gas ratios.
Cross entropy loss function is used to quantify the difference between the outputs of CNN and the targets, and the CNN is trained by using the Adam algorithm. The corresponding program is developed on Pytorch.
A set of numerical tests are performed in order to clarify the advantages and disadvantages of the proposed methodology. The samples used to train and test the CNN performance are collected from a power supply company in China and public literature [26–31], and the sample size distribution is shown in Table 1. The effect of the CNN architecture on the performance of the diagnosis model is studied to define its optimal architecture; afterwards the proposed methodology is applied to the public IEC TC 10 transformer fault database [36] so that the generalization potential of the proposed methodology is verified.
In the following numerical experiments, 10-fold cross validation is used to select the optimal CNN architecture. The data set shown in Table 1 is randomly divided into 10 subsets. The CNN is trained 10 times, each time using 9 subsets for training and 1 subset for testing. The loss and accuracy during the training process are then averaged. The learning rate of the training algorithm is 0.001. The training is stopped after 10 epochs.
The effect of the architecture of the CNN on the classification performance
Average pooling vs. max pooling
There are two commonly used pooling filters, average pooling and max pooling. The optimal choice depends on the characteristic of the input feature maps. In other words, the choice of pooling filter is problem-dependent. In order to determine the best pooling filter for the specific application, a comparison test is made; the architecture of the CNN is shown in Fig. 2.
The comparison results are shown in Fig. 3. In order to clearly outline the variation of the training loss, it is calculated in every 100 iterations instead of in every epoch. Only one feature map is the input of the CNN in each iteration, consequently each epoch contains N train iterations, where N train is the sample size of the training set. It can be concluded that the max pooling performs slightly better than the average pooling. First, the final loss when using max pooling is smaller than the one by using average pooling, meaning that the difference between the outputs and the targets is smaller when max pooling is used. Second, both the training accuracy and the test accuracy are higher when using max pooling. However, it should be noticed that the fluctuation is more evident during the training process when max pooling is used, which can be concluded from the variation in training accuracy.

The performance comparison between the max pooling and average pooling. (a) Loss. (b) Training accuracy and test accuracy.
In summary, the average pooling is more stable than the max pooling but with lower accuracy. The max pooling has a higher probability to train a better CNN for the studied problem, consequently, the max pooling filter is used in the following test.
The convolutional layer is the core of the CNN. In order to investigate the effect of the number of convolutional layers on the accuracy of the results, the number of convolutional layers is varied from 2 to 4.
From the comparison results shown in Fig. 4(a), it can be seen that the loss is irrelative to the number of convolutional layers, and the structure with 3 convolutional layers achieves both the highest training accuracy and test accuracy. It is also found from Fig. 4(b) that when a 2-layer structure is used, the difference between the training accuracy and the test accuracy is smaller than others. However, the final training accuracy and test accuracy of a 2-layer is lower than that of a 3-layer structure, meaning there exists underfitting in 2-layer structure. The situation is the opposite when using a 4-layer structure, where the difference between the training accuracy and the test accuracy is at a significant level. Apparently, the overfitting appears in the 4-layer structure.

The performance comparison of different convolutional layer architectures. (a) Loss. (b) Training accuracy and test accuracy.
In summary, the 3-layer structure for the convolutional layer is the optimal choice to protect the CNN from both underfitting and overfitting.
The function of the fully connected layer is to generate the mapping relationship between the features extracted by convolutional layers and pooling layers and the targets. In order to determine the optimal structure of the fully connected layer, a comparison test is made. The number of fully connected layers is varied from 1 to 3. The number of neurons is 5 (1-layer), 50-5 (2-layer), and 100-20-5 (3-layer) respectively. The comparison results displayed in Fig. 5 demonstrate that the loss, training accuracy, and test accuracy exhibit minimal differences between 1-layer, 2-layer, and 3-layer structures. Therefore, it can be inferred that the number of layers in the fully connected layer structure has an insignificant impact on the performance of the CNN.

The performance comparison of different fully connected layer architectures. (a) Loss. (b) Training accuracy and test accuracy.
In summary, considering the stability and the computational cost of the CNN, a single fully connected layer is chosen.
In the above tests, the training set and the test set are collected from the power supply company and the public literature. In order to validate the generalization potential of the proposed methodology, the test set is replaced by IEC TC 10 transformer data set [36]. The sample size distribution is shown in Table 3.
Sample size of each category in IEC TC 10 database
Sample size of each category in IEC TC 10 database
The architecture of the CNN is based on the investigation made in the above subsection. Specifically, a 3-layers structure for the convolutional layer connected by max pooling layers a 1 fully connected layer are used. The test results of the proposed methodology are shown in Fig. 6.

The performance of the proposed methodology without using the improved SMOTE on public IEC TC 10 database. (a) Loss. (b) Training accuracy and test accuracy. (c) Confusion matrix of training set. (d) Confusion matrix of test set.
From Fig. 6(a) it is evident that the training process converges within 10 epochs and reaches a stable loss. The differences between the training accuracy and the test accuracy reduce as the training process proceeds. The final difference between the training accuracy and the test accuracy is below 3%, meaning that the proposed methodology achieves great generalization. Figure 6(c,d) show the confusion matrices of the training results and the test results. According to the definition of the confusion matrix, rows represent the real labels of samples and columns represent the labels predicted by the CNN. The diagnosis accuracy of each fault type can be calculated based on the confusion matrix, and the result is shown in Table 4.
In order to observe the influence of sample size on classification accuracy, the sample size of each fault type in the original training set is also shown in Table 4. It can be seen that fault types with small sample sizes achieve low diagnosis accuracy (both precision rate and recall rate), especially PD and L-M-O. This phenomenon appears both in the training set and the test set. In other words, there exist some bias in the trained CNN, which makes the CNN mainly focuses on the accurate classification of the classes with large sample sizes. In order to remove the bias of the CNN, the improved SMOTE is applied to generate some synthetic samples to make the sample size of each fault type even.
The results obtained by combining the improved SMOTE with the CNN are shown in Fig. 7. Similarly, the training process also can converge within 10 epochs, and the difference between the training accuracy and the test accuracy is still below 3%. Consequently, the proposed model still has great generalization after the improved SMOTE is introduced. The conclusion can be obtained by comparing the precision rate and the recall rate before and after the improved SMOTE is used. Apparently, the training accuracies (both the precision rate and the recall rate) of the minorities are improved, meaning that the improved SMOTE is effective to remove the bias of the CNN produced in the training process. However, it should also be noted that the training accuracy of the whole training set may slightly degrade.
For a classification problem, apart from the accuracy, two other important indices to evaluate the performance of a classifier are precision rate and recall rate. In a multi-classification problem, these two indices can be calculated as follows
The precision rate and recall rate distribution
P: precision rate; R: recall rate.

The performance of the proposed methodology with using the improved SMOTE on public IEC TC 10. (a) Loss. (b) Training accuracy and test accuracy. (c) Confusion matrix of training set. (d) Confusion matrix of test set.
The test accuracies with or without using the improved SMOTE are close. However, the precision rate and the recall rate distribution are different. Usually, for a specific class, only the precision rate or the recall rate can be improved, and the other one will slightly degrade. The effect of the SMOTE on the CNN should be further studied, especially the problem of how to achieve the balance between the precision rate and the recall rate after the SMOTE is used.
The comparisons of the diagnosis results of IEC TC 10 database obtained by different methods
Transformer faults can be roughly categorized into two types: overheating faults and discharge faults. The proposed methodology achieves 100% accuracy in distinguishing between these two types, as shown in Figs 7(d) and 6(d). However, there may be some prediction errors when evaluating the severity of the faults. For instance, some samples of high temperature overheating may be mislabeled as low-middle temperature overheating fault. This is because there is no explicit boundary between these types of faults. Similarly, there may be confusion between high-energy discharge, low-energy discharge, and particle discharge. Nevertheless, the methodology achieves an approximately 90% accuracy in evaluating the severity of the faults and a 100% accuracy in recognizing the type of fault, which is satisfactory.
A comparison is conducted between the proposed methodology and previous studies. The main advantage of using CNN is its ability to automatically select effective features. Therefore, the compared studies were previous intelligent transformer fault diagnosis methods that also employed feature selection. All the vital characteristics of methods proposed in papers [16,18,19], including the used classifier (SVM) and the adopted optimal feature combination, are retained when conducting the comparison. The compared models are trained using the same data set, and the final test results are shown in Table 5. The proposed methodology outperforms previous studies, achieving a diagnosis accuracy of 89.4%, while the methods proposed in papers [16,18,19] achieve 83.0%, 84.0%, 87.2%. The comparison results confirm the accuracy of the proposed methodology.
A methodology to diagnose transformer faults based on DGA by using CNN is proposed. Considering the unbalance of the training data set, the improved SMOTE is combined with the data cleaning technique to generate synthetic samples to make the sample size of each fault type even.
The main outcome of this study are given below: The architecture of the CNN has an influence on the classification performance. For the CNN used in the study, the max pooling has a higher potential to achieve better classification accuracy. Choosing an appropriate number of layers for the convolutional layer can protect the CNN from both underfitting and overfitting. For example, a 3-layer structure is an optimal choice for the proposed model. The number of layers of the fully connected layer has insignificant effect on the performance of the CNN. The CNN may benefit from choosing a simple structure for the fully connected layer. The proposed methodology achieves great generalization with the difference between the training accuracy and the test accuracy below 3%. The introduction of the improved SMOTE can effectively remove the bias of the CNN caused by the imbalanced training data set. However, the strategy to achieve the balance between the precision rate and the recall rate should be further studied.
