Abstract
Feature extraction is an important preprocessing step in many research areas. For anomaly detection, the purpose of feature extraction lies in not only extracting the most important features hidden in the datasets, but also discriminating different classes of samples. The latter is usually referred to as discriminative ability. The data collected from production systems usually do not follow Gaussian distribution. They may correspond to nonlinear mixture of independent components. In order to cope with non-Gaussian data and implement nonlinear feature extraction, this article proposes a feature extraction algorithm based on Supervised Independent Component Analysis with Kernel (termed SKICA). SKICA first adopts Kernel Principle Component Analysis (KPCA) to whiten the datasets. Further, by virtue of the within-cluster scatter matrix derived from Linear Discriminate Analysis (LDA), SKICA extends Independent Component Analysis (ICA) to supervised situation by introducing within-cluster information into solving independent components. The latter improvement makes SKICA obtain the independent components more beneficial to separating different classes of samples. In order to quantitatively measure discriminative ability of the feature extraction algorithms involved in experiments, this article defines three kinds of average square distance. This article conducts experiments on artificial datasets, Cloud datasets, and KDD Cup datasets to evaluate the effectiveness of SKICA. The experimental results show that SKICA outperforms several popular supervised feature extraction algorithms, including LDA, LDA with kernel (KDA), and supervised ICA (SICA).
Keywords
Introduction
In order to implement faster and better data analysis, a multi-dimensional dataset is usually transformed into a lower dimensional space while reserving the most useful information at the same time. This requirement leads to the advent of a kind of techniques called feature extraction. As an important preprocessing step, feature extraction is widely used in many research areas including anomaly detection ([1]). Each dimensionality of the dataset before and after feature extraction is usually referred to as a (performance) metric and a feature respectively.
Anomaly detection is defined as the problem of finding patterns in data that do not conform to expected behaviors ([1]). These nonconforming patterns are often referred to as anomalies. For anomaly detection, the purpose of feature extraction lies in not only extracting the most important features hidden in data, but also discriminating different classes of samples. The latter is usually referred to as discriminative ability ([2]).
One challenge for feature extraction techniques is that the data collected from production systems usually do not follow Gaussian distribution, as shown in Section 5.1. Therefore, the feature extraction techniques based on Gaussian distribution assumption, e.g., Principle Component Analysis (PCA), Linear Discriminate Analysis (LDA), cannot guarantee excellent performance.
In addition, feature extraction techniques can be classified into linear and nonlinear techniques according to the adopted transformation. One limitation of linear techniques is the assumption of linear or approximately linear structure of data. To deal with nonlinear structure of data, many nonlinear techniques have been proposed in literature, such as manifold learning, neural network, and genetic algorithm. However, these nonlinear techniques are usually difficult to implement or the complexity is high. Therefore, this article only focuses on linear techniques. Despite this, through introducing kernel method and choosing kernel functions, linear techniques may also implement nonlinear feature extraction and deal with datasets with nonlinear structure, which is confirmed in the experiments (Section 5).
In order to cope with the challenges including non-Gaussian data and nonlinear feature extraction, this article proposes a new feature extraction algorithm for anomaly detection, which is based on Supervised Independent Component Analysis with Kernel (termed SKICA). The first key step of SKICA is whitening the datasets in input Euclidean space (i.e., R
n
) by using Kernel Principle Component Analysis (KPCA), which is equivalent to whitening the mapped datasets in Hilbert space
This article introduces three kinds of average square distance among samples after feature extraction to quantitatively measure discriminative ability of the feature extraction algorithms involved in experiments. In order to evaluate the effectiveness of SKICA, this article conducts experiments on three kinds of datasets, i.e., artificial datasets, Cloud datasets, and KDD Cup datasets. The experimental results show that SKICA outperforms several popular supervised feature extraction algorithms including LDA, LDA with kernel (KDA), and supervised ICA (SICA).
The main contributions of this article are listed as follows:
It proposes a feature extraction algorithm based on supervised independent component analysis with kernel (i.e., SKICA); It introduces quantitative criteria to measure the discriminative ability among all classes of samples; It conducts experiments on multiple datasets to testify the effectiveness of SKICA.
The remainder of this article is organized as follows. Section 2 summarizes related work. Section 3 gives preliminaries. Section 4 presents the proposed SKICA in detail. Section 5 conducts experiments. Section 6 concludes this article and looks into future work.
Related work
Finding a more suitable representation of multivariate data is a long-standing problem in such areas as statistics, pattern recognition, data mining (DM), and machine learning. Representation means transforming the data so that its essential structure is made more visible or the transformed data are easier to understand than original data ([3]). Feature extraction techniques ([4]) transform the datasets from an n-dimensional space to an s-dimensional space (s is usually smaller than n) and reserve the most useful information at the same time.
PCA ([5, 6]) is a classical linear feature extraction technique, which converts a set of correlated metrics into another set of uncorrelated components (i.e., features) by using an orthogonal transform. LDA ([7, 8]) searches for the vectors maximizing Fisher criterion function as the best projection directions and obtains maximum between-cluster scatter and minimum within-cluster scatter. These linear techniques ([5–9]) are based on Gaussian distribution assumption. Unfortunately, this assumption usually does not satisfy in practice. Therefore, these techniques are insufficient to extract optimal features from datasets with arbitrary probability distribution.
Stuhlsatz et al. ([10]) propose a generalized discriminant analysis (GerDA). GerDA uses nonlinear transformation learnt by deep neural networks (DNNs) in a semi-supervised fashion. Liu et al. ([11]) propose a one-dimensional continuous wavelet transform (CWT) to extract the features of colorectal cancer data. But note that, this article only focuses on linear techniques.
Independent Component Analysis (ICA) is a new and powerful data analysis method. ICA can find underlying factors or components from multivariate statistical data. These components may correspond to some physical causes that are involved in the process of generating the data. The essential difference between ICA and PCA is that ICA finds both statistically independent and non-Gaussian components in multivariate data. Heranlt and Jutten ([12]) first proposed the elementary idea of ICA in 1986. Subsequently, Comon ([13]) gave a rigorous mathematical definition of ICA in 1994. Although the research history of ICA is not long compared with PCA and LDA, ICA receives more and more attention in both theory and application domains. Currently, ICA is widely used in feature extraction ([14–21]), dimensionality reduction ([22]), Blind Source Separation (BSS), fault diagnosis ([23]), pattern recognition ([24, 25]), etc.
For anomaly identification in large-scale distributed systems, Lan et al. ([14]) introduce ICA based feature extraction, which converts the multidimensional performance metric data into a lower dimensional space for quick and better analysis. They compare ICA with PCA for feature extraction. The experimental results show that the adopted cell-based anomaly detection mechanism combined with ICA-based feature extraction can effectively identify faulty nodes with high accuracy and low computation overhead.
Traditional ICA is an unsupervised learning method. Therefore, the obtained independent components are not always useful for classification and recognition. The following several research efforts ([15–19]) in literature have been conducted to extend ICA to supervised situation.
Akaho ([15]) proposes conditionally independent component analysis (CICA) to find a conditionally independent representation of input variables for a target variable through linear transformation. The proposed CICA attempts to maximize the independence among extracted features, and maximize the mutual information between extracted features and a target variable at the same time. The experiments show that CICA is useful for naive Bayes learning and Bayesian networks. However, there is no evidence which shows CICA is useful for anomaly detection.
Kwak and Choi ([16]) append standard ICA with binary class labels to produce a number of features that carry information about the class labels. The proposed feature extraction algorithm based on ICA (termed ICA-FX) is available to a task of feature extraction for classification problems by maximizing the joint mutual information between class labels and new features. ICA-FX is developed for binary classification. More research effort is expected to extend it for multiclass problems.
Takabatake et al. ([17]) extend ICA using category information. The proposed method is implemented in a three-layer neural network (i.e., nonlinear techniques). The category information is included into the objective function to guide the learning of the neural network. The experimental results show that the proposed supervised ICA obtains higher recognition accuracy as compared with traditional ICA.
To solve high dimension of input data in face image recognition, Wang ([18]) proposes an integrated method. First, supervised manifold learning is adopted to reduce the dimension fo the input data, and then kernel ICA is adopted to extract the independent components. The kernel ICA aims at maximizing independence and minimizing kernel correlation. In contrast, the purpose of introducing kernel method into SKICA presented in this article is to implement nonlinear feature extraction.
Yamazaki and Fels ([19]) propose a local image descriptor based on a kind of supervised kernel ICA that uses a non-linear kernel approach combined with supervised ICA. They enhance class separability by maximizing the Mahalanobis distance between classes. In contrast, SKICA presented in this article introduces within-cluster information (
In addition, many linear techniques can be extended to nonlinear situation by kernel method. When the input space is R
n
, and the feature space is a Hilbert space
Preliminaries
Notations:
An italic lowercase letter represents a scalar value, e.g., x. A bold and italic lowercase letter represents a vector, e.g.,
A bold and italic uppercase letter represents a matrix, e.g.,
An italic uppercase letter represents a random variable, e.g., X. A bold and italic uppercase letter represents a random vector, e.g.,
Assume that, the original dataset consists of n metrics; each metric is considered as a random variable, X
i
, i = 1, 2, …, n; the sampling values of X
i
are denoted as X
i
(t), i = 1, …, n, t = 1, …, l. Also assume that these n metrics constitute an n-dimensional random vector,
The problem of representing multivariate data can be intuitively stated as follows: searching a transformation from an n-dimensional space to an s-dimensional space such that the transformed variables (i.e., the extracted features) give information hidden in the original dataset.
Suppose that each of the extracted features, Y
i
, can be expressed as a linear combination of these n metrics:
where w
ij
are coefficients in the transformation. The above transformation can be expressed as the following matrix multiplication:
The target of ICA is to determine the s-by-n transformation matrix
Further, assume that the row vectors (also in the format of column vectors) of
The target of ICA is to solve
A principle for determining
An important preprocessing step in ICA is whitening, which can be implemented by PCA. Whitening can be formalized as: given an n-dimensional random vector,
Suppose that: the covariance matrix of
The key of ICA lies in how to represent statistical independence. According to the central limit theorem, the sum of a set of independent random variables tends to follow Gaussian distribution. Therefore, an intuitive principle of ICA is to maximize non-Gaussianity. The basic idea is: taking a linear combination of the observed sampling values, y =
Therefore, the principle of ICA is: given the whitened random vector
Non-Gaussianity can be measured by negentropy. For a continuous random variable Y, its differential entropy is defined as:
where p
Y
(y) is its probability density function. A fundamental result of information theory is that a Gaussian variable has the largest entropy among all random variables of equal variance. Negentropy of Y is defined as follows:
where Y gauss is a Gaussian random variable of the same covariance as Y.
In practice, negentropy is approximated as:
where υ is the standard Gaussian random variable, G is a non-quadratic function. Usually, G can be chosen as (g is the derivative function of G):
Therefore, the target of ICA is converted as maximizing the negentropy of y =
where s is the number of solved independent components; δ kj is the Kronecker function, i.e., δ kj = 1 when k = j, δ kj = 0 when k ≠ j.
The above optimization problem can be solved by the Gradient algorithm or fixed-point algorithms. FastICA ([28]) is a fast converged ICA algorithm, which is based on a fixed-point iteration scheme maximizing non-Gaussianity. The iterative formula of FastICA is:
where g’(u) is the derivative function of g.
When ICA methods are used for feature extraction, the solved independent components are taken as the extracted features, which correspond to the physical cause generating the observed data. The concrete steps of the feature extraction algorithm based on FastICA ([28]) are listed as follows.
where g is defined, e.g., as in Equations. 9–11.
Compared with PCA, ICA has the following potential advantages:
ICA provides a more suitable probability model, which can better represent data in n-dimensional space; ICA uniquely determines the transformation matrix
The solved independent components in ICA do not need to be orthogonal; ICA uses higher-order statistics, which contain more information than covariance (i.e., second-order statistics); ICA assumes that the underlying independent components must have non-Gaussian distributions, which is more practical in real situations.
Despite the above advantages, ICA has the following two insufficiencies. First of all, ICA is an unsupervised learning method. It does not take full advantage of label or category information of samples. Second, ICA assumes that the observed data is the linear mixture of independent components. Therefore, ICA cannot separate nonlinearly mixed observed data. However, the collected observed data from production systems are usually not the simply linear mixture of independent components.
This article first extends ICA by introducing kernel method so as to make ICA can separate nonlinear mixed observed data. This article further extends ICA to supervised situation by introducing category information into the process of solving independent components. The solved independent components are more beneficial for anomaly detection. Namely, this article takes full advantage of category information to improve ICA. Based on these two improvements, this article proposes a feature extraction algorithm based on supervised independent component analysis with kernel (termed SKICA).
SKICA contains two key steps, which are detailed in Section 4.1 and 4.2 respectively.
Introducing kernel method into ICA – whitening data using KPCA
For the observed data of non-linear structure in R
n
space (input space), the purpose of introducing kernel method into ICA is to map the observed data from R
n
to a Hilbert space (output space or feature space)
While,
is the kernel function associated with the map Φ.
As stated in Section 3, an important preprocessing step in ICA is whitening, which can be implemented by PCA. Note that, performing PCA based on the mapped data in feature space
Given l samples,
The covariance operator of these mapped samples in the feature space
If
Let
Calculate the former s largest positive eigenvalues of
Let
Let
then
Thus, the whitening matrix
The above transformation can be specifically expressed by:
where
Now consider the problem of centering data. It is easy to center data in input space R
n
. But it is difficult to do so in feature space
Replace
In order to make full use of the category information of samples, ICA needs to be extended to supervised situation. This article introduces the following ideas in LDA to extend ICA.
Assuming that the l samples can be divided into c classes,
where
are within-cluster scatter matrix and mean of the i-th class respectively.
bfitS
W
represents the aggregation degree of these l samples. The object of ICA is to solve vectors
where ɛ is a predefined threshold. The following iterative formula is obtained through solving Equation 32 by FastICA.
where r p is a positive scalar parameter, which is used to control the degree of introduced within-cluster information. If r p = 0, Equation 33 is degraded to the traditional ICA method.
The first key step in SKICA is to whiten the sample data in R
n
by using KPCA. This key step maps the sample data into feature space
The concrete steps of SKICA are listed as follows.
where g is defined, e.g., as in Equations 9–11.
Experiments and analyses
Datasets
This article conduct experiments on the following three kinds of datasets to evaluate the effectiveness of SKICA.
(1) Artificial datasets
Artificial datasets are widely used for algorithm test and evaluation. The underlying reason lies in the possibility to generate random datasets which follow required probability distribution. Narasimhamurthy and Kuncheva present a framework for generating data to simulate changing environments ([29]). They provide several artificial datasets ([30]). All of these datasets have a certain degree of nonlinear structure and therefore bring great challenges for linear feature extraction algorithms involved in experiments. Their scatter diagrams of these datasets are shown in Fig. 1(a)-(h).
(2) Cloud datasets

8 artificial datasets.
As stated in Section 1, the data collected from production systems usually do not follow Gaussian distribution. The heath sates (normal or abnormal) of virtual machines (VMs) in production Cloud platforms can be characterized by performance metrics. This article first constructs an institute-wide Cloud platform and collects a set of 53 performance metrics for each VM. These performance metrics can be classified into five categories: computation, storage, disk I/O, process, and network.
This article testifies the probability distribution of the collected performance metrics. First, all metrics of a random selected VM are sampled every 5 seconds lasting 2 hours. Therefore, 1440 sampled values are obtained. Then eight performance metrics are optionally chosen, and their histograms are plotted, as shown in Fig. 2. From these histograms, it is verified that the performance metrics data of VMs under Cloud environment usually do not follow the Gaussian distribution.

The histograms of eight optionally chosen performance metrics.
In order to detect abnormal states of VMs, an anomaly detection algorithm, e.g., Support Vector Machine (SVM, [31, 32]), trains a detection model based on a training sample set represented by the extracted features. The detection model then determines a new sample as normal or abnormal.
(3) KDD Cup dataset
KDD Cup 1999 dataset ([33]) drives from a dataset which is simulated and collected in a military network environment ([34]). Therefore, KDD Cup dataset is a real dataset. KDD Cup dataset is released by the fifth International Conference on Knowledge Discovery and Data Mining (KDD). Each sample characterizes a connection record between a source host and a destination host during a complete session. Each sample contains 41 features, excluding the label of attack category. KDD Cup dataset contains a training dataset (4,898,431 samples totally), which is collected from 7 weeks of network traffic. The dataset also contains a test dataset (2,984,154 samples totally), which is collected from 2 weeks of network traffic. KDD Cup dataset contains a wide variety of intrusions including Probe, DOS, U2R, and R2L. Therefore, it is widely used for the test and evaluation of algorithms in intrusion detection, anomaly detection, data mining, etc.
This article evaluates SKICA compared with the other three supervised feature extraction algorithms from the following two aspects. These three algorithms include traditional LDA ([7, 8]), LDA with kernel (KDA, [27]), and supervised ICA with category information (SICA, [17]).
(1) Discriminative ability test
As stated in Section 1, one purpose of feature extraction in anomaly detection is to discriminate different classes of samples. This ability is usually referred to as discriminative ability. In order to quantitatively measure discriminative ability of the feature extraction algorithms involved in experiments, this article defines the following three square distances.
d (
d
d
For the sample matrix
d
At last, d
B
(X) is computed as:
For the sample matrix
For a given feature extraction algorithm and a given dataset, if the number of extracted features is specified, the smaller the d
W
(
(2) Detection accuracy test
Another purpose of feature extraction in anomaly detection is to extract the most important and effective features for representing data. Therefore, as an important preprocessing step, feature extraction is expected to enhance the accuracy of anomaly detection algorithms. This article combines the involved feature extraction algorithms with a same anomaly detection algorithm (standard SVM, i.e., C-SVM, [31, 32]) and tests the detection accuracy of the combined anomaly detection mechanisms on Cloud dataset. This article adopts the following two accuracy measures. (F
P
, False Positive; F
N
, False Negative; T
P
, True Positive; T
N
, True Negative)
Sensitivity: the proportion of True Positive (i.e., correctly detected abnormal states of VMs) to the number of actual abnormal states.
Specificity: the proportion of True Negative (i.e., correctly detected normal states of VMs) to the number of actual normal states.
This article conducts the following two sets of experiments to evaluate the effectiveness of the proposed SKICA.
(1) Discriminative ability experiments
The results of four supervised feature extraction algorithms (i.e., LDA, KDA, SICA, and the proposed SKICA) on 8 artificial datasets are listed in Table 1. For each dataset, only one feature is extracted. Each cell in the latter four columns of Table 1 contains two values, the former is d
W
(
Experimental results of 4 supervised feature extraction algorithms on 8 artificial datasets
Experimental results of 4 supervised feature extraction algorithms on 8 artificial datasets
The experimental results show that the discriminative ability of LDA in terms of d
W
(
Most importantly, the proposed SKICA outperforms the other three algorithms on all the 8 artificial datasets. Specifically, the d
W
(
Table 2 lists the experimental results of these 4 supervised feature extraction algorithms on 2 Cloud datasets. The number of extracted features is specified as 16. Therefore, the dimensionality reduction rates of these four algorithms are (53 –16)/53 = 69.81%. The results show that the proposed SKICA still evidently outperforms the other three algorithms on non-Gaussian Cloud datasets. Specially, the d
W
(
Experimental results of 4 supervised feature extraction algorithms on 2 Cloud datasets
Table 3 lists the experimental results of these 4 supervised algorithms on 2 KDD Cup datasets. The number of extracted features is specified as 12. Therefore, the dimensionality reduction rates of these four algorithms are (41 –12)/41 = 70.73%. The results show that the proposed SKICA still outperforms the other three supervised algorithms on KDD Cup datasets in terms of d
W
(
Experimental results of 4 supervised feature extraction algorithms on 2 KDD Cup datasets
(2) Detection accuracy experiments
In this set of experiments, all of these four feature extraction algorithms extract 16 features from 53 performance metrics of Cloud datasets. First, a training sample set containing 2000 samples is constructed. Four combined anomaly detection mechanisms (i.e., LDA pulus C-SVM, KDA pulus C-SVM, SICA pulus C-SVM, SKICA pulus C-SVM) train the detection models on this training sample set respectively. A test sample set containing another 2000 samples is constructed at the same time. The accuracy measures of these detection models on this test sample set are calculated and listed in Table 4.
Usually, there exists much correlation and redundancy among the original performance metrics. Some metrics may even be mixed with measurement noise since they are collected from production systems in real time. Feature extraction algorithms can reduce or even eliminate the correlation and redundancy among features. The experimental results in Table 4 show that SKICA plus C-SVM achieves the highest performance measures in terms of both Sensitivity and Specificity.
Performance measure results of four anomaly detection mechanisms on Cloud datasets
In sum, SKICA improves ICA from two aspects, i.e., solving nonlinear mixture of observed data and introducing category information. From the above two sets of experiments, it is concluded that theproposed SKICA not only enlarge the distance among different classes of samples, but also extract the most effective features for anomaly detection. Therefore, SKICA is more applicable than the other three algorithms for anomaly detection under supervised situation.
Feature extraction techniques for anomaly detection face the challenges including non-Gaussian data and nonlinear feature extraction. To solve these challenges, this article proposes a new feature extraction algorithm termed SKICA. SKICA inherits the advantage of ICA, i.e., the solved components or features are mutually independent. Moreover, SKICA implements nonlinear feature extraction, and enlarges the distance among different classes of samples, which avails subsequent anomaly detection.
During experiments, this article adopts the standard SVM (i.e., C-SVM) as the anomaly detection algorithm. In future, this article will consider more powerful anomaly detection algorithms for different application domains. The effectiveness of the detection mechanisms (i.e., SKICA combined with powerful detection algorithms) will be testified in future.
Footnotes
Acknowledgments
The authors are grateful to the editor and anonymous reviewers for their valuable comments on this article.
The work of this article is supported by National Natural Science Foundation of China (Grant No. 61272399 and No. 61572090), Chongqing Research Program of Basic Research and Frontier Technology (Grant No. cstc2015jcyjBX0014, and No. cstc2016jcyjA0304), and the Scientific and Technological Research Program of Chongqing Municipal Education Commission (No. KJ1600521).
