SKICA: A feature extraction algorithm based on supervised ICA with kernel for anomaly detection

Abstract

Feature extraction is an important preprocessing step in many research areas. For anomaly detection, the purpose of feature extraction lies in not only extracting the most important features hidden in the datasets, but also discriminating different classes of samples. The latter is usually referred to as discriminative ability. The data collected from production systems usually do not follow Gaussian distribution. They may correspond to nonlinear mixture of independent components. In order to cope with non-Gaussian data and implement nonlinear feature extraction, this article proposes a feature extraction algorithm based on Supervised Independent Component Analysis with Kernel (termed SKICA). SKICA first adopts Kernel Principle Component Analysis (KPCA) to whiten the datasets. Further, by virtue of the within-cluster scatter matrix derived from Linear Discriminate Analysis (LDA), SKICA extends Independent Component Analysis (ICA) to supervised situation by introducing within-cluster information into solving independent components. The latter improvement makes SKICA obtain the independent components more beneficial to separating different classes of samples. In order to quantitatively measure discriminative ability of the feature extraction algorithms involved in experiments, this article defines three kinds of average square distance. This article conducts experiments on artificial datasets, Cloud datasets, and KDD Cup datasets to evaluate the effectiveness of SKICA. The experimental results show that SKICA outperforms several popular supervised feature extraction algorithms, including LDA, LDA with kernel (KDA), and supervised ICA (SICA).

Keywords

Feature extraction anomaly detection independent component analysis (ICA)supervised kernel method

1 Introduction

In order to implement faster and better data analysis, a multi-dimensional dataset is usually transformed into a lower dimensional space while reserving the most useful information at the same time. This requirement leads to the advent of a kind of techniques called feature extraction. As an important preprocessing step, feature extraction is widely used in many research areas including anomaly detection ([1]). Each dimensionality of the dataset before and after feature extraction is usually referred to as a (performance) metric and a feature respectively.

Anomaly detection is defined as the problem of finding patterns in data that do not conform to expected behaviors ([1]). These nonconforming patterns are often referred to as anomalies. For anomaly detection, the purpose of feature extraction lies in not only extracting the most important features hidden in data, but also discriminating different classes of samples. The latter is usually referred to as discriminative ability ([2]).

One challenge for feature extraction techniques is that the data collected from production systems usually do not follow Gaussian distribution, as shown in Section 5.1. Therefore, the feature extraction techniques based on Gaussian distribution assumption, e.g., Principle Component Analysis (PCA), Linear Discriminate Analysis (LDA), cannot guarantee excellent performance.

In addition, feature extraction techniques can be classified into linear and nonlinear techniques according to the adopted transformation. One limitation of linear techniques is the assumption of linear or approximately linear structure of data. To deal with nonlinear structure of data, many nonlinear techniques have been proposed in literature, such as manifold learning, neural network, and genetic algorithm. However, these nonlinear techniques are usually difficult to implement or the complexity is high. Therefore, this article only focuses on linear techniques. Despite this, through introducing kernel method and choosing kernel functions, linear techniques may also implement nonlinear feature extraction and deal with datasets with nonlinear structure, which is confirmed in the experiments (Section 5).

In order to cope with the challenges including non-Gaussian data and nonlinear feature extraction, this article proposes a new feature extraction algorithm for anomaly detection, which is based on Supervised Independent Component Analysis with Kernel (termed SKICA). The first key step of SKICA is whitening the datasets in input Euclidean space (i.e., Rⁿ) by using Kernel Principle Component Analysis (KPCA), which is equivalent to whitening the mapped datasets in Hilbert space $H$ using PCA. The second key step of SKICA is extending Independent Component Analysis (ICA) to supervised situation through introducing within-cluster information derived from LDA into solving independent components. This improvement makes SKICA obtain the independent components more beneficial toseparating different classes of samples.

This article introduces three kinds of average square distance among samples after feature extraction to quantitatively measure discriminative ability of the feature extraction algorithms involved in experiments. In order to evaluate the effectiveness of SKICA, this article conducts experiments on three kinds of datasets, i.e., artificial datasets, Cloud datasets, and KDD Cup datasets. The experimental results show that SKICA outperforms several popular supervised feature extraction algorithms including LDA, LDA with kernel (KDA), and supervised ICA (SICA).

The main contributions of this article are listed as follows:

It proposes a feature extraction algorithm based on supervised independent component analysis with kernel (i.e., SKICA);

It introduces quantitative criteria to measure the discriminative ability among all classes of samples;

It conducts experiments on multiple datasets to testify the effectiveness of SKICA.

The remainder of this article is organized as follows. Section 2 summarizes related work. Section 3 gives preliminaries. Section 4 presents the proposed SKICA in detail. Section 5 conducts experiments. Section 6 concludes this article and looks into future work.

2 Related work

Finding a more suitable representation of multivariate data is a long-standing problem in such areas as statistics, pattern recognition, data mining (DM), and machine learning. Representation means transforming the data so that its essential structure is made more visible or the transformed data are easier to understand than original data ([3]). Feature extraction techniques ([4]) transform the datasets from an n-dimensional space to an s-dimensional space (s is usually smaller than n) and reserve the most useful information at the same time.

PCA ([5, 6]) is a classical linear feature extraction technique, which converts a set of correlated metrics into another set of uncorrelated components (i.e., features) by using an orthogonal transform. LDA ([7, 8]) searches for the vectors maximizing Fisher criterion function as the best projection directions and obtains maximum between-cluster scatter and minimum within-cluster scatter. These linear techniques ([5 –9]) are based on Gaussian distribution assumption. Unfortunately, this assumption usually does not satisfy in practice. Therefore, these techniques are insufficient to extract optimal features from datasets with arbitrary probability distribution.

Stuhlsatz et al. ([10]) propose a generalized discriminant analysis (GerDA). GerDA uses nonlinear transformation learnt by deep neural networks (DNNs) in a semi-supervised fashion. Liu et al. ([11]) propose a one-dimensional continuous wavelet transform (CWT) to extract the features of colorectal cancer data. But note that, this article only focuses on linear techniques.

Independent Component Analysis (ICA) is a new and powerful data analysis method. ICA can find underlying factors or components from multivariate statistical data. These components may correspond to some physical causes that are involved in the process of generating the data. The essential difference between ICA and PCA is that ICA finds both statistically independent and non-Gaussian components in multivariate data. Heranlt and Jutten ([12]) first proposed the elementary idea of ICA in 1986. Subsequently, Comon ([13]) gave a rigorous mathematical definition of ICA in 1994. Although the research history of ICA is not long compared with PCA and LDA, ICA receives more and more attention in both theory and application domains. Currently, ICA is widely used in feature extraction ([14 –21]), dimensionality reduction ([22]), Blind Source Separation (BSS), fault diagnosis ([23]), pattern recognition ([24, 25]), etc.

For anomaly identification in large-scale distributed systems, Lan et al. ([14]) introduce ICA based feature extraction, which converts the multidimensional performance metric data into a lower dimensional space for quick and better analysis. They compare ICA with PCA for feature extraction. The experimental results show that the adopted cell-based anomaly detection mechanism combined with ICA-based feature extraction can effectively identify faulty nodes with high accuracy and low computation overhead.

Traditional ICA is an unsupervised learning method. Therefore, the obtained independent components are not always useful for classification and recognition. The following several research efforts ([15 –19]) in literature have been conducted to extend ICA to supervised situation.

Akaho ([15]) proposes conditionally independent component analysis (CICA) to find a conditionally independent representation of input variables for a target variable through linear transformation. The proposed CICA attempts to maximize the independence among extracted features, and maximize the mutual information between extracted features and a target variable at the same time. The experiments show that CICA is useful for naive Bayes learning and Bayesian networks. However, there is no evidence which shows CICA is useful for anomaly detection.

Kwak and Choi ([16]) append standard ICA with binary class labels to produce a number of features that carry information about the class labels. The proposed feature extraction algorithm based on ICA (termed ICA-FX) is available to a task of feature extraction for classification problems by maximizing the joint mutual information between class labels and new features. ICA-FX is developed for binary classification. More research effort is expected to extend it for multiclass problems.

Takabatake et al. ([17]) extend ICA using category information. The proposed method is implemented in a three-layer neural network (i.e., nonlinear techniques). The category information is included into the objective function to guide the learning of the neural network. The experimental results show that the proposed supervised ICA obtains higher recognition accuracy as compared with traditional ICA.

To solve high dimension of input data in face image recognition, Wang ([18]) proposes an integrated method. First, supervised manifold learning is adopted to reduce the dimension fo the input data, and then kernel ICA is adopted to extract the independent components. The kernel ICA aims at maximizing independence and minimizing kernel correlation. In contrast, the purpose of introducing kernel method into SKICA presented in this article is to implement nonlinear feature extraction.

Yamazaki and Fels ([19]) propose a local image descriptor based on a kind of supervised kernel ICA that uses a non-linear kernel approach combined with supervised ICA. They enhance class separability by maximizing the Mahalanobis distance between classes. In contrast, SKICA presented in this article introduces within-cluster information ( w ^TbfitS_W w in Equation 32) into solving independent components, so as to obtain the independent components more beneficial to anomaly detection.

In addition, many linear techniques can be extended to nonlinear situation by kernel method. When the input space is Rⁿ, and the feature space is a Hilbert space $H$ , an input vector in Rⁿ is transformed by an implicit map Φ into an output vector in the Hilbert space. The kernel function associated with Φ represents the inner product between output vectors. Implicitly linear processing in a high dimensional Hilbert space using the kernel function is equivalent to nonlinear processing in Rⁿ. This method is called kernel method, which is a very common and general machine learning method. Scholkopf et al. ([26]) first introduce kernel method into PCA and propose kernel PCA (KPCA). Mika et al. ([27]) introduce kernel method into Fisher discriminant analysis and propose kernel based Fisher discriminant analysis (KDA).

3 Preliminaries

Notations:

An italic lowercase letter represents a scalar value, e.g., x.

A bold and italic lowercase letter represents a vector, e.g., x ; in addition, a vector in Hilbert space $H$ is represented by a bold lowercase letter, e.g., x. Further, all the vectors in this article are column ones.

A bold and italic uppercase letter represents a matrix, e.g., X ; while a matrix in Hilbert space $H$ is represented by a bold uppercase letter, e.g., X.

An italic uppercase letter represents a random variable, e.g., X.

A bold and italic uppercase letter represents a random vector, e.g., X .

Assume that, the original dataset consists of n metrics; each metric is considered as a random variable, X_i, i = 1, 2, …, n; the sampling values of X_i are denoted as X_i (t), i = 1, …, n, t = 1, …, l. Also assume that these n metrics constitute an n-dimensional random vector, X , whose sampling values are denoted as x ₁, x ₂, …, x _l.

The problem of representing multivariate data can be intuitively stated as follows: searching a transformation from an n-dimensional space to an s-dimensional space such that the transformed variables (i.e., the extracted features) give information hidden in the original dataset.

Suppose that each of the extracted features, Y_i, can be expressed as a linear combination of these n metrics: $Y_{i} (t) = \sum_{j} w_{ij} X_{j} (t), i = 1, \dots, s, j = 1, \dots, n,$ (1)

where w_ij are coefficients in the transformation. The above transformation can be expressed as the following matrix multiplication: $[\begin{matrix} Y_{1} (t) \\ Y_{2} (t) \\ ⋮ \\ Y_{s} (t) \end{matrix}] = W [\begin{matrix} X_{1} (t) \\ X_{2} (t) \\ ⋮ \\ X_{n} (t) \end{matrix}] .$ (2)

The target of ICA is to determine the s-by-n transformation matrix W only according to the statistical properties of the transformed components Y_i.

Further, assume that the row vectors (also in the format of column vectors) of W are w ₁, w ₂, …, w _s, then $W = {[w_{1}, w_{2} \dots w_{s}]}^{T}$ (3)

The target of ICA is to solve w ₁, w ₂, …, w _s.

A principle for determining W is statistical independence: the components Y_i should be mutually independent. This means that Y_i may correspond to some physical causes generating the data and Y_i are mutually independent during the generation process.

An important preprocessing step in ICA is whitening, which can be implemented by PCA. Whitening can be formalized as: given an n-dimensional random vector, X , searching for a linear transformation V such that the elements Z_i in the transformed random vector Z are uncorrelated and have unit variances. $Z = VX .$ (4)

Suppose that: the covariance matrix of X is C _X; the eigenvectors of C _X are ν ₁, ν ₂, …, ν _n; the associated eigenvalues are λ₁, λ₂, …, λ_n. Let E = [ ν ₁, ν ₂ … ν _n], D = diag (λ₁, λ₂, …, λ_n). Then a linear whitening transform is given by: $V = D^{- 1 / 2} E^{T} .$ (5)

The key of ICA lies in how to represent statistical independence. According to the central limit theorem, the sum of a set of independent random variables tends to follow Gaussian distribution. Therefore, an intuitive principle of ICA is to maximize non-Gaussianity. The basic idea is: taking a linear combination of the observed sampling values, y = w ^T x , if y just equals to one of the independent components, its non-Gaussianity achieves a maximum value. This is because that, if y really is a mixture of two or more independent components, it may be closer to Gaussian distribution according to the central limit theorem.

Therefore, the principle of ICA is: given the whitened random vector Z and its sample value z ₁, z ₂, …, z _l, the task is to solve the vectors w (under the constraint ∥ w ∥ = 1) so as to make the non-Gaussianity of y = w ^T z achieve a maximum value.

Non-Gaussianity can be measured by negentropy. For a continuous random variable Y, its differential entropy is defined as: $H (Y) = - \int p_{Y} (y) log (p_{Y} (y)) dy,$ (6)

where p_Y(y) is its probability density function. A fundamental result of information theory is that a Gaussian variable has the largest entropy among all random variables of equal variance. Negentropy of Y is defined as follows: $J (Y) = H (Y_{gauss}) - H (Y),$ (7)

where Y_gauss is a Gaussian random variable of the same covariance as Y.

In practice, negentropy is approximated as: $J (Y) \propto [E {G (Y)} - E {G (υ)}]^{2},$ (8)

where υ is the standard Gaussian random variable, G is a non-quadratic function. Usually, G can be chosen as (g is the derivative function of G): $\begin{matrix} G_{1} (u) = \frac{1}{a_{1}} log cosh (a_{1} u) \\ g_{1} (u) = tanh (a_{1} u) \end{matrix},$ (9)

$\begin{matrix} G_{2} (u) = - \frac{1}{a_{2}} exp (- a_{2} u^{2} / 2) \\ g_{2} (u) = u exp (- a_{2} u^{2} / 2) \end{matrix},$ (10)

$G_{3} (u) = \frac{1}{4} u^{4}, g_{3} (u) = u^{3} .$ (11)

Therefore, the target of ICA is converted as maximizing the negentropy of y = w ^T z , which can be further converted into the following optimization problem: $\begin{matrix} max J (w) = [E {G (w_{i}^{T} z)} - E {G (υ)}]^{2} \\ i = 1, 2, . . ., s . \\ s . t . E {(w_{k}^{T} z) (w_{j}^{T} z)} = δ_{kj} \end{matrix},$ (12)

where s is the number of solved independent components; δ_kj is the Kronecker function, i.e., δ_kj= 1 when k = j, δ_kj = 0 when k ≠ j.

The above optimization problem can be solved by the Gradient algorithm or fixed-point algorithms. FastICA ([28]) is a fast converged ICA algorithm, which is based on a fixed-point iteration scheme maximizing non-Gaussianity. The iterative formula of FastICA is: $w \leftarrow E {z g (w^{T} z)} - E {g^{'} (w^{T} z)} w,$ (13)

where g’(u) is the derivative function of g.

When ICA methods are used for feature extraction, the solved independent components are taken as the extracted features, which correspond to the physical cause generating the observed data. The concrete steps of the feature extraction algorithm based on FastICA ([28]) are listed as follows.

Algorithm 1. The feature extraction algorithm based on FastICA.

Input: l unlabeled n-dimensional samples, x ₁, x ₂, …, x _l; the number of extracted features, s.

Output: The transformation matrix, W _s×n; the sample matrix after feature extraction, Y _s×l.

Step 1. Organize the samples into an n-by-l data matrix X _n×l = [ x ₁, x ₂ ⋯ x _l]; Center X _n×l to make its mean zero; The centered matrix is also denoted as X _n×l.

Step 2. Whiten the centered X _n×l to give a new white data matrix, which is denoted as Z _n×l.

Step 3. Let the number of estimated independent components equal to s.

Step 4. Choose a random initial matrix W = [ w ₁ w ₂ … w _s] ^T, where ∥ w ∥ = 1. Orthogonalize W by using the following Equation 15.

Step 5. For i = 1 to s, update w _i by: $w_{i} \leftarrow E {z g (w_{i}^{T} z)} - E {g^{'} (w_{i}^{T} z)} w_{i},$ (14)

where g is defined, e.g., as in Equations. 9–11.

Step 6. Symmetrically orthogonalize the matrix W by: $W = {({WW}^{T})}^{- 1 / 2} W .$ (15)

Step 7. If $(1 - - | w_{i}^{T} w_{i} |) < δ$ , where δ is a small predefined threshold which indicates the end of the iteration, then the algorithm converges; Otherwise, it goes to step 5 for next iteration.

Step 8. Transform the data matrix X _n×l from n-dimensional space to s-dimensional space by the transformation Y _s×l = W _s×n × X _n×l, and obtain Y _s×l.

4 The proposed feature extraction algorithm – SKICA

Compared with PCA, ICA has the following potential advantages:

ICA provides a more suitable probability model, which can better represent data in n-dimensional space;

ICA uniquely determines the transformation matrix W ;

The solved independent components in ICA do not need to be orthogonal;

ICA uses higher-order statistics, which contain more information than covariance (i.e., second-order statistics);

ICA assumes that the underlying independent components must have non-Gaussian distributions, which is more practical in real situations.

Despite the above advantages, ICA has the following two insufficiencies. First of all, ICA is an unsupervised learning method. It does not take full advantage of label or category information of samples. Second, ICA assumes that the observed data is the linear mixture of independent components. Therefore, ICA cannot separate nonlinearly mixed observed data. However, the collected observed data from production systems are usually not the simply linear mixture of independent components.

This article first extends ICA by introducing kernel method so as to make ICA can separate nonlinear mixed observed data. This article further extends ICA to supervised situation by introducing category information into the process of solving independent components. The solved independent components are more beneficial for anomaly detection. Namely, this article takes full advantage of category information to improve ICA. Based on these two improvements, this article proposes a feature extraction algorithm based on supervised independent component analysis with kernel (termed SKICA).

SKICA contains two key steps, which are detailed in Section 4.1 and 4.2 respectively.

4.1 Introducing kernel method into ICA – whitening data using KPCA

For the observed data of non-linear structure in Rⁿ space (input space), the purpose of introducing kernel method into ICA is to map the observed data from Rⁿ to a Hilbert space (output space or feature space) $H$ by using the following nonlinear map Φ so as to make the mapped data have linear structure. $\begin{matrix} Φ : R^{n} \to, H \\ X \mapsto X = Φ (X), \end{matrix}$ (16)

While, $k (X, X^{'}) = (Φ (X) \cdot Φ (X^{'})) = Φ {(X)}^{T} Φ (X^{'})$ (17)

is the kernel function associated with the map Φ.

As stated in Section 3, an important preprocessing step in ICA is whitening, which can be implemented by PCA. Note that, performing PCA based on the mapped data in feature space $H$ can be equivalently implemented in input space Rⁿ by kernel method, i.e., performing KPCA based on the observed data. The adopted KPCA ([25]) is detailed below.

Given l samples, x ₁, x ₂, …, x _l, in input space Rⁿ. First assume the mapped samples in feature space $H$ are centered (the centering problem in $H$ will be discussed later), i.e., $\sum_{i = 1}^{l} Φ (X_{i}) = 0 .$ (18)

The covariance operator of these mapped samples in the feature space $H$ can be computed by: $S^{Φ} = \frac{1}{l} \sum_{i = 1}^{l} Φ (X_{i}) Φ (X_{i})^{T} .$ (19)

If $H$ is a finite-dimensional feature space, S^Φ is generally referred to as covariance matrix. S^Φ is a positive operator, therefore non-zero eigenvalues of S^Φ are all positive.

Let Q = [Φ ( x ₁) Φ ( x ₂) … Φ ( x _l)], then S^Φ can be expressed by S^Φ = (1/l) QQ^T. Construct a Gram matrix R = Q^TQ. R is an l-by-l matrix, and its elements r_ij can be determined by the kernel function K ( x , x′) associated with Φ, i.e., $r_{ij} = Φ {(X_{i})}^{T} Φ (X_{j}) = (Φ (X_{i}) \cdot Φ (X_{j})) = K (X_{i}, X_{j}) .$ (20)

Calculate the former s largest positive eigenvalues of R, λ₁, λ₂, …, λ_s, λ₁ ≥ λ₂ ≥ … λ_s, and the associated eigenvectors, v₁, v₂, …, v_s. Then the former s largest positive eigenvalues of S^Φ can be expressed by λ₁/l, λ₂/l, …, λ_s/l, and the associated eigenvectors β₁, β₂, …, β_s can be expressed by: $β_{i} = \frac{1}{\sqrt{λ_{i}}} Q ν_{i}, i = 1, 2, . . ., s .$ (21)

Let V = [v₁ v₂ … v_s], Λ = diag (λ₁ λ₂ … λ_s), and B = [β₁ β₂ … β_s] = QVΛ^-1/2, then $B^{T} S^{Φ} B = diag (\frac{λ_{1}}{l}, \frac{λ_{2}}{l}, . . ., \frac{λ_{s}}{l}) = \frac{1}{l} Λ .$ (22)

Let $P = B {(\frac{1}{l} Λ)}^{- 1 / 2} = \sqrt{l} QV Λ^{- 1},$ (23)

then P^TS^ΦP = I.

Thus, the whitening matrix P is obtained. The mapped data in feature space $H$ can be whitened by the following transformation: $z = P^{T} Φ (X) .$ (24)

The above transformation can be specifically expressed by: $\begin{matrix} z = \sqrt{l} Λ^{- 1} V^{T} Q^{T} Φ (X) \\ = \sqrt{l} Λ^{- 1} V^{T} [k (x_{1}, x) k (x_{2}, x) \dots k (x_{l}, x)]^{T} \\ = \sqrt{l} Λ^{- 1} V^{T} R_{X} \end{matrix} .$ (25)

where $R_{X} = {[k (X_{1}, X) k (X_{2}, X) \dots k (X_{l}, X)]}^{T} .$ (26)

Now consider the problem of centering data. It is easy to center data in input space Rⁿ. But it is difficult to do so in feature space $H$ because the mean of the mapped data in $H$ cannot be explicitly computed. Fortunately, slightly modifying the above process can achieve this. Let C _M = (1/l) _l×l represents an l-by-l matrix where all of the elements are 1/l; C ₁ = (1/l) _l×1 represents a similar l-by-1 matrix. Center the Gram matrices R and R_x by: $\tilde{R} = R - C_{M} R - R C_{M} + C_{M} R C_{M},$ (27) ${\tilde{R}}_{x} = R_{x} - C_{M} R_{x} - R C_{1} + C_{M} R C_{1} .$ (28)

Replace R with $\tilde{R}$ , and compute its former s largest positive eigenvaluesλ₁,λ₂, …,λ _s and the associated eigenvectors ν₁, ν₂, …, ν_s. Let V = [v₁ v₂ … v_s], Λ = diag (λ₁ λ₂ … λ_s), then the whitening transformation can be expressed by: $z = \sqrt{l} Λ^{- 1} V^{T} {\tilde{R}}_{X} .$ (29)

4.2 Extending ICA to supervised situation

In order to make full use of the category information of samples, ICA needs to be extended to supervised situation. This article introduces the following ideas in LDA to extend ICA.

Assuming that the l samples can be divided into c classes, $H_{i} = {X_{1}^{(i)}, \dots, X_{l_{i}}^{(i)}} \subset R^{n}$ is the subset of i-th class samples, l_i is the number of samples in i-th class, i = 1,..., c, l = l₁ + l₂ +... + l_c. The within-cluster scatter matrix (bfitS_W) is defined as: $S_{W} = \sum_{i = 1}^{c} S_{i} = \sum_{i = 1}^{c} \sum_{X \in H_{i}} (X - - m_{i}) (X - - m_{i})^{T},$ (30)

where $\begin{matrix} S_{i} = \sum_{X \in H_{i}} (X - - m_{i}) (X - - m_{i})^{T} \\ m_{i} = \frac{1}{l_{i}} \sum_{X \in H_{i}} X \end{matrix}, i = 1, 2, \dots, c .$ (31)

are within-cluster scatter matrix and mean of the i-th class respectively.

bfitS_W represents the aggregation degree of these l samples. The object of ICA is to solve vectors w which maximize the negentropy of Y. This article introduces w ^TbfitS_W w into the optimization problem of ICA (Equation 12) as a constraint condition, thus obtaining the following optimization problem: $\begin{matrix} max J (w) = [E {G (w_{i}^{T} z)} - E {G (υ)}]^{2} \\ s . t . E {(w_{k}^{T} z) (w_{j}^{T} z)} = δ_{kj} \\ w_{i}^{T} S_{W} w_{i} - ɛ \leq 0 \\ i = 1, 2, . . ., s . \end{matrix} .$ (32)

where ɛ is a predefined threshold. The following iterative formula is obtained through solving Equation 32 by FastICA. $\begin{matrix} w \leftarrow E {z g (w^{T} z)} - E {g^{'} (w^{T} z)} w \\ + r_{p} (S_{W} w) (S_{W} w)^{T} w \end{matrix},$ (33)

where r_p is a positive scalar parameter, which is used to control the degree of introduced within-cluster information. If r_p = 0, Equation 33 is degraded to the traditional ICA method.

4.3 SKICA

The first key step in SKICA is to whiten the sample data in Rⁿ by using KPCA. This key step maps the sample data into feature space $H$ at the same time. The second key step in SKICA is to conduct supervised ICA in feature space $H$ . SKICA implements ICA still through maximizing non-Gaussianity. Its principle is intuitive and easy to implement.

The concrete steps of SKICA are listed as follows.

Algorithm 2. The feature extraction algorithm based on SKICA.

Input: l labeled n-dimensional samples, x ₁, x ₂, … x _l; these l samples can be classified into c classes; the numbers of samples in each class are l₁, l₂,..., l_c, l₁ + l₂ + … + l_c = l; the number of extracted features, s.

Output: The transformation matrix, W _s×n; the sample matrix after feature extraction, Y _s×l.

Step 1. Organize the samples into an n-by-l data matrix X _n×l = [ x ₁ x ₂ … x _l]; Center X _n×l to make its mean zero; The centered matrix is also denoted as X _n×l.

Step 2. Compute the within-cluster scatter matrix bfitS_W of X _n×l by Equation 30.

Step 3. Choose the kernel function K (x, x′).

Step 4. Whiten the centered X _n×l to give a new white data matrix in feature space $H$ , which is denoted as Z_n×l = [z₁ z₂ … z_l].

Step 5. Let the number of estimated independent components equal to s.

Step 6. Choose a random initial matrix W = [ w ₁ w ₂ … w _s] ^T, where ∥ w _i ∥ = 1. Orthogonalize W by Equation 15.

Step 7. For i = 1 to s, update w _i by: $\begin{matrix} w_{i} \leftarrow E {z g (w_{i}^{T} z)} - E {g^{'} (w_{i}^{T} z)} w_{i} \\ + r_{p} (S_{W} w_{i}) (S_{W} w_{i})^{T} w_{i} \end{matrix},$

where g is defined, e.g., as in Equations 9–11.

Step 8. Symmetrically orthogonalize the matrix W by W = (bfitWW^T) ^-1/2 W .

Step 9. If $(1 - - | w_{i}^{T} w_{i} |) < δ$ , where δ is a small predefined threshold which indicates the end of the iteration, then the algorithm converges; Otherwise, it goes to step 7 for next iteration.

Step 10. Transform the data matrix X _n×l from n-dimensional space to s-dimensional space by the transformation Y _s×l = W _s×n × X _n×l, and obtain Y _s×l.

5 Experiments and analyses

5.1 Datasets

This article conduct experiments on the following three kinds of datasets to evaluate the effectiveness of SKICA.

(1) Artificial datasets

Artificial datasets are widely used for algorithm test and evaluation. The underlying reason lies in the possibility to generate random datasets which follow required probability distribution. Narasimhamurthy and Kuncheva present a framework for generating data to simulate changing environments ([29]). They provide several artificial datasets ([30]). All of these datasets have a certain degree of nonlinear structure and therefore bring great challenges for linear feature extraction algorithms involved in experiments. Their scatter diagrams of these datasets are shown in Fig. 1(a)-(h).

(2) Cloud datasets

Fig.1

8 artificial datasets.

As stated in Section 1, the data collected from production systems usually do not follow Gaussian distribution. The heath sates (normal or abnormal) of virtual machines (VMs) in production Cloud platforms can be characterized by performance metrics. This article first constructs an institute-wide Cloud platform and collects a set of 53 performance metrics for each VM. These performance metrics can be classified into five categories: computation, storage, disk I/O, process, and network.

This article testifies the probability distribution of the collected performance metrics. First, all metrics of a random selected VM are sampled every 5 seconds lasting 2 hours. Therefore, 1440 sampled values are obtained. Then eight performance metrics are optionally chosen, and their histograms are plotted, as shown in Fig. 2. From these histograms, it is verified that the performance metrics data of VMs under Cloud environment usually do not follow the Gaussian distribution.

Fig.2

The histograms of eight optionally chosen performance metrics.

In order to detect abnormal states of VMs, an anomaly detection algorithm, e.g., Support Vector Machine (SVM, [31, 32]), trains a detection model based on a training sample set represented by the extracted features. The detection model then determines a new sample as normal or abnormal.

(3) KDD Cup dataset

KDD Cup 1999 dataset ([33]) drives from a dataset which is simulated and collected in a military network environment ([34]). Therefore, KDD Cup dataset is a real dataset. KDD Cup dataset is released by the fifth International Conference on Knowledge Discovery and Data Mining (KDD). Each sample characterizes a connection record between a source host and a destination host during a complete session. Each sample contains 41 features, excluding the label of attack category. KDD Cup dataset contains a training dataset (4,898,431 samples totally), which is collected from 7 weeks of network traffic. The dataset also contains a test dataset (2,984,154 samples totally), which is collected from 2 weeks of network traffic. KDD Cup dataset contains a wide variety of intrusions including Probe, DOS, U2R, and R2L. Therefore, it is widely used for the test and evaluation of algorithms in intrusion detection, anomaly detection, data mining, etc.

5.2 Experimental methods

This article evaluates SKICA compared with the other three supervised feature extraction algorithms from the following two aspects. These three algorithms include traditional LDA ([7, 8]), LDA with kernel (KDA, [27]), and supervised ICA with category information (SICA, [17]).

(1) Discriminative ability test

As stated in Section 1, one purpose of feature extraction in anomaly detection is to discriminate different classes of samples. This ability is usually referred to as discriminative ability. In order to quantitatively measure discriminative ability of the feature extraction algorithms involved in experiments, this article defines the following three square distances.

d ( X ): total average square distance;

d_W (X): within-cluster average square distance;

d_B (X): between-cluster average square distance.

For the sample matrix X _n×l before feature extraction, let $X_{k}^{(i)} \in H_{i}$ , $X_{t}^{(j)} \in H_{j}$ be the samples in the i-th and j-th class respectively, $δ (X_{k}^{(i)}, X_{t}^{(j)})$ be the square distance (i.e., Euclidean distance) between these two samples, p_i, p_j be the priori probabilities of these two classes; then d ( X ) is defined as:

$d (X) = \frac{1}{2} \sum_{i = 1}^{c} p_{i} \sum_{j = 1}^{c} p_{j} \frac{1}{l_{i} l_{j}} \sum_{k = 1}^{l_{i}} \sum_{t = 1}^{l_{j}} δ (X_{k}^{(i)}, X_{t}^{(j)}) .$ (34)

d_W (X) is defined as: $d_{W} (X) = \frac{1}{2} \sum_{i = 1}^{c} p_{i} \frac{1}{l_{i}^{2}} \sum_{k = 1}^{l_{i}} \sum_{t = 1}^{l_{i}} δ (X_{k}^{(i)}, X_{t}^{(i)}) .$ (35)

At last, d_B (X) is computed as: $d_{B} (X) = d (X) - d_{W} (X) .$ (36)

For the sample matrix Y _s×l after feature extraction, the square distances, d ( Y ), d_W ( Y ), and d_B ( Y ), can also be computed in a similar way.

For a given feature extraction algorithm and a given dataset, if the number of extracted features is specified, the smaller the d_W ( Y ) value, the more closely the samples after feature extraction assemble together in each class; the larger the d_B ( Y ) value, the farther the samples after feature extraction which belongs to different classes separate; thus, the better the discriminative ability of the samples after feature extraction, and the better the performance of the feature extraction algorithm.

(2) Detection accuracy test

Another purpose of feature extraction in anomaly detection is to extract the most important and effective features for representing data. Therefore, as an important preprocessing step, feature extraction is expected to enhance the accuracy of anomaly detection algorithms. This article combines the involved feature extraction algorithms with a same anomaly detection algorithm (standard SVM, i.e., C-SVM, [31, 32]) and tests the detection accuracy of the combined anomaly detection mechanisms on Cloud dataset. This article adopts the following two accuracy measures. (F_P, False Positive; F_N, False Negative; T_P, True Positive; T_N, True Negative)

Sensitivity: the proportion of True Positive (i.e., correctly detected abnormal states of VMs) to the number of actual abnormal states. $Sensitivity = T_{P} / (T_{P} + F_{N})$ (37)

Specificity: the proportion of True Negative (i.e., correctly detected normal states of VMs) to the number of actual normal states.

Specificity = T_{N} / (F_{P} + T_{N})

(38)

5.3 Experimental results and analyses

This article conducts the following two sets of experiments to evaluate the effectiveness of the proposed SKICA.

(1) Discriminative ability experiments

The results of four supervised feature extraction algorithms (i.e., LDA, KDA, SICA, and the proposed SKICA) on 8 artificial datasets are listed in Table 1. For each dataset, only one feature is extracted. Each cell in the latter four columns of Table 1 contains two values, the former is d_W ( Y ) and the latter is d_B ( Y ).

Table 1
Experimental results of 4 supervised feature extraction algorithms on 8 artificial datasets

No. of datasets # of metrics # of features LDA (d_W(Y), d_B(Y)) KDA (d_W(Y), d_B(Y)) SICA (d_W(Y), d_B(Y)) SKICA (d_W(Y), d_B(Y))

1 12 1 (0.6651, 0.9843) (0.6517, 0.9895) (0.6373, 1.0044) (0.5817, 1.1413)

2 12 1 (1.4818, 0.1907) (1.4765, 0.1990) (1.4173, 0.2135) (1.2831, 0.4015)

3 12 1 (1.4420, 0.1549) (1.4376, 0.2607) (1.4116, 0.3824) (1.1995, 0.6207)

4 2 1 (0.7899, 0.3613) (0.7510, 0.3911) (0.7184, 0.4383) (0.6001, 0.6316)

5 2 1 (0.4147, 0.8250) (0.3819, 0.8317) (0.3607, 0.8952) (0.2997, 0.9913)

6 2 1 (0.9974, 0.0113) (0.8896, 0.1025) (0.7367, 0.2743) (0.6816, 0.4125)

7 2 1 (0.4958, 0.9375) (0.4027, 0.9587) (0.3262, 0.9710) (0.2987, 0.9899)

8 2 1 (0.9841, 0.0757) (0.9150, 0.1639) (0.8328, 0.2395) (0.7101, 0.3861)

No. of datasets	# of metrics	# of features	LDA (d_W(Y), d_B(Y))	KDA (d_W(Y), d_B(Y))	SICA (d_W(Y), d_B(Y))	SKICA (d_W(Y), d_B(Y))
1	12	1	(0.6651, 0.9843)	(0.6517, 0.9895)	(0.6373, 1.0044)	(0.5817, 1.1413)
2	12	1	(1.4818, 0.1907)	(1.4765, 0.1990)	(1.4173, 0.2135)	(1.2831, 0.4015)
3	12	1	(1.4420, 0.1549)	(1.4376, 0.2607)	(1.4116, 0.3824)	(1.1995, 0.6207)
4	2	1	(0.7899, 0.3613)	(0.7510, 0.3911)	(0.7184, 0.4383)	(0.6001, 0.6316)
5	2	1	(0.4147, 0.8250)	(0.3819, 0.8317)	(0.3607, 0.8952)	(0.2997, 0.9913)
6	2	1	(0.9974, 0.0113)	(0.8896, 0.1025)	(0.7367, 0.2743)	(0.6816, 0.4125)
7	2	1	(0.4958, 0.9375)	(0.4027, 0.9587)	(0.3262, 0.9710)	(0.2987, 0.9899)
8	2	1	(0.9841, 0.0757)	(0.9150, 0.1639)	(0.8328, 0.2395)	(0.7101, 0.3861)

The experimental results show that the discriminative ability of LDA in terms of d_W ( Y ) and d_B ( Y ) values is the poorest since LDA is based on Gaussian distribution assumption. KDA improves LDA through introducing kernel method, thus applicable for non-Gaussian datasets. Therefore, the discriminative ability of KDA outperforms that of LDA. SICA extracts the independent components as features. For the former five Gaussian-like distribution datasets, SICA slightly outperforms LDA and KDA; while for the latter three non-Gaussian distribution datasets, SICA evidently outperforms LDA and slightly outperforms KDA.

Most importantly, the proposed SKICA outperforms the other three algorithms on all the 8 artificial datasets. Specifically, the d_W ( Y ) value achieved by SKICA in each group of experiment is the smallest, which means the samples in each class represented by the extracted features assemble most closely; while the d_B ( Y ) value is the biggest in each group of experiment, which means the distance among each class of samples is the farthest. Therefore, SKICA achieves the best discriminative ability.

Table 2 lists the experimental results of these 4 supervised feature extraction algorithms on 2 Cloud datasets. The number of extracted features is specified as 16. Therefore, the dimensionality reduction rates of these four algorithms are (53 –16)/53 = 69.81%. The results show that the proposed SKICA still evidently outperforms the other three algorithms on non-Gaussian Cloud datasets. Specially, the d_W ( Y ) values of SKICA are the smallest in each group of experiment, while the d_B ( Y ) values are the largest.

Table 2

Experimental results of 4 supervised feature extraction algorithms on 2 Cloud datasets

No. of datasets	# of metrics	# of features	LDA (d_W(Y), d_B(Y))	KDA (d_W(Y), d_B(Y))	SICA (d_W(Y), d_B(Y))	SKICA (d_W(Y), d_B(Y))
1	53	16	(26.0131, 7.8965)	(25.2519, 8.2427)	(25.5879, 8.3971)	(24.0137, 8.5445)
2	53	16	(28.3883, 8.2812)	(27.8564, 8.5173)	(27.6821, 8.5966)	(27.1018, 8.8239)

Table 3 lists the experimental results of these 4 supervised algorithms on 2 KDD Cup datasets. The number of extracted features is specified as 12. Therefore, the dimensionality reduction rates of these four algorithms are (41 –12)/41 = 70.73%. The results show that the proposed SKICA still outperforms the other three supervised algorithms on KDD Cup datasets in terms of d_W ( Y ) and d_B ( Y ) values.

Table 3

Experimental results of 4 supervised feature extraction algorithms on 2 KDD Cup datasets

No. of datasets	# of metrics	# of features	LDA (d_W(Y), d_B(Y))	KDA (d_W(Y), d_B(Y))	SICA (d_W(Y), d_B(Y))	SKICA (d_W(Y), d_B(Y))
1	41	12	(19.9781, 5.6319)	(19.3826, 5.7623)	(19.2671, 5.8740)	(18.9713, 6.1517)
2	41	12	(22.0159, 6.2908)	(21.8830, 6.4127)	(21.6194, 6.4933)	(21.3441, 6.5625)

(2) Detection accuracy experiments

In this set of experiments, all of these four feature extraction algorithms extract 16 features from 53 performance metrics of Cloud datasets. First, a training sample set containing 2000 samples is constructed. Four combined anomaly detection mechanisms (i.e., LDA pulus C-SVM, KDA pulus C-SVM, SICA pulus C-SVM, SKICA pulus C-SVM) train the detection models on this training sample set respectively. A test sample set containing another 2000 samples is constructed at the same time. The accuracy measures of these detection models on this test sample set are calculated and listed in Table 4.

Usually, there exists much correlation and redundancy among the original performance metrics. Some metrics may even be mixed with measurement noise since they are collected from production systems in real time. Feature extraction algorithms can reduce or even eliminate the correlation and redundancy among features. The experimental results in Table 4 show that SKICA plus C-SVM achieves the highest performance measures in terms of both Sensitivity and Specificity.

Table 4

Performance measure results of four anomaly detection mechanisms on Cloud datasets

Algorithm	T_P	F_P	F_N	T_N	Sensitivity	Specificity
LDA + C-SVM	942	103	78	877	0.924	0.895
KDA + C-SVM	959	82	61	898	0.940	0.916
SICA + C-SVM	964	75	56	905	0.945	0.923
SKICA + C-SVM	973	68	47	912	0.954	0.931

In sum, SKICA improves ICA from two aspects, i.e., solving nonlinear mixture of observed data and introducing category information. From the above two sets of experiments, it is concluded that theproposed SKICA not only enlarge the distance among different classes of samples, but also extract the most effective features for anomaly detection. Therefore, SKICA is more applicable than the other three algorithms for anomaly detection under supervised situation.

6 Conclusion and future work

Feature extraction techniques for anomaly detection face the challenges including non-Gaussian data and nonlinear feature extraction. To solve these challenges, this article proposes a new feature extraction algorithm termed SKICA. SKICA inherits the advantage of ICA, i.e., the solved components or features are mutually independent. Moreover, SKICA implements nonlinear feature extraction, and enlarges the distance among different classes of samples, which avails subsequent anomaly detection.

During experiments, this article adopts the standard SVM (i.e., C-SVM) as the anomaly detection algorithm. In future, this article will consider more powerful anomaly detection algorithms for different application domains. The effectiveness of the detection mechanisms (i.e., SKICA combined with powerful detection algorithms) will be testified in future.

Footnotes

Acknowledgments

The authors are grateful to the editor and anonymous reviewers for their valuable comments on this article.

The work of this article is supported by National Natural Science Foundation of China (Grant No. 61272399 and No. 61572090), Chongqing Research Program of Basic Research and Frontier Technology (Grant No. cstc2015jcyjBX0014, and No. cstc2016jcyjA0304), and the Scientific and Technological Research Program of Chongqing Municipal Education Commission (No. KJ1600521).

References

Chandola

, Banerjee

and Kumar

, Anomaly Detection: A Survey, ACM Computing Surveys 41(3) (2009) Article 15.

Davis

J.J.

and Clark

A.J.

, Data preprocessing for anomaly based network intrusion detection: A review, Computers & Security 30(6-7) (2011), 353–375.

Hyvarinen

, Karhunen

and Oja

, Independent Component Analysis, John-Wiley & Sons Inc. Press, 2001.

Ding

S.F.

, Zhu

, Jia

W.K.

and Su

C.Y.

, A survey on feature extraction for pattern recognition, Artificial Intelligence Review 37(3) (2012), 169–180.

Pearson

, On lines and planes of closest fit to systems of points in space, Philosophical Magazine 2(11) (1901), 559–572.

Pechenizkiy

, Puuronen

and Tsymbal

, The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning, SAC (2006), pp. 553–558.

Fisher

R.A.

, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7(2) (1936), 179–188.

Ohta

and Ozawa

, An improvement of incremental recursive fisher linear discriminant for online feature extraction, Electronics and Communications in Japan 96(4) (2013), 29–40.

Jing

X.Y.

, Li

, Lan

et al., Color image canonical correlation analysis for face feature extraction and recognition, Signal Processing 91(8) (2011), 2132–2140.

10.

Stuhlsatz

, Lippel

and Zielke

, Feature extraction with deep neural networks by a generalized discriminant analysis, IEEE Transactions on Neural Networks and Learning Systems 23(4) (2012), 596–608.

11.

Liu

Y.H.

, Aickelin

, Feyereisl

and Durrant

L.G.

, Wavelet feature extraction and genetic algorithm for biomarker detection in colorectal cancer data, Knowledge-Based Systems 37 (2013), 502–514.

12.

Herault

and Jutten

, Space or Time Adaptive Signal Processing by Neural Network Models, AIP Conference Proceedings 15(1) (on Neural Networks for Computing), (1986), pp. 206–211.

13.

Comon

, Independent Component Analysis, a New Concept? Signal Processing 36(3) (1994), 287–314.

14.

Lan

Z.L.

, Zheng

Z.M.

and Li

Y.W.

, Toward automated anomaly identification in large-scale systems, IEEE Transactions on Parallel and Distributed Systems 21(2) (2010), 174–187.

15.

Akaho

, Conditionally independent component analysis for supervised feature extraction, Neurocomputing 49 (2002), 139–150.

16.

Kwak

and Choi

C.-H.

, Feature extraction based on ICA for binary classification problems, IEEE Transactions on Knowledge and Data Engineering 15(6) (2003), 1374–1388.

17.

Takabatake

, Kotani

and Ozawa

, Feature extraction by supervised independent component analysis based on category information, Electrical Engineering in Japan 161(2) (2007), 542–547.

18.

Wang

X.M.

, Supervised Manifold Learning and Kernel Independent Component Analysis Applied lo the Face Image Recognition, In Proceedings of the 15th Conference on Intelligent Computation Technology and Automation, 2012, pp. 600–603.

19.

Yamazaki

and Fels

, Local Image Descriptors Using Supervised Kernel ICA, In Proceedings of 3rd Pacific Rim Symposium on Advances in Image and Video Technology 2009. pp. 94–105.

20.

Tao

M.L.

, Zhou

, Liu

and Z. Zhang

, Tensorial independent component analysis-based feature extraction for polarimetric SAR data classification, IEEE Transactions on Geoscience and Remote Sensing 53(5) (2015). 2481–2495.

21.

Zhao

C.H.

, Wang

Y.L.

and Mei

, Kernel ICA feature extraction for anomaly detection in hyperspectral imagery, Chinese Journal of Electronics 21(2) (2012), 265–269.

22.

Kwak

, Kim

and Kim

, Dimensionality reduction based on ICA for regression problems, Neurocomputing 71, (2008). 2596–2603.

23.

Palmieri

, Fiore

and Castiglione

, A distributed approach to network anomaly detection based on independent component analysis, Concurrency and Computation: Practice and Experience 26(5) (2014), 1113–1129.

24.

Bartlett

M.S.

, Movellan

J.R.

and Sejnowski

T.J.

, Face recognition by independent component analysis, IEEE Transactions on Neural Networks 13(6) (2002). 1450–1464.

25.

Yang

, Gao

X.M.

, Zhang

and Yang

J.Y.

, Kernel ICA: An alternative formulation and its application to face recognition, Pattern Recognition 38(10) (2005), 1784–1787.

26.

Scholkopf

, Smola

and Miiller

K.R.

, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation 10(5) (1998), 1299–1319.

27.

Mika

, Ratsch

, Weston

, Scholkopf

and K. Miiller

, Fisher Discriminant Analysis with Kernels, In Proceedings of the 1999 IEEE Signal Processing Society Workshop, Neural Networks for Signal Processing IX 1999. pp. 41–48.

28.

Hyvarinen

and Oja

, A fasi fixed-poinl algorithm for independent component analysis, Neural Computation 9(7) (1997). 1483–1492.

29.

Narasimhamurthy

and Kuncheva

L.I.

, A Framework for Generating Data to Simulate Changing Environments, In Proceedings of the 25th IASTED International Multi-conference: Artificial intelligence and applications (A/AP), 2007. pp. 384–389.

30.

Kuncheva

L.I.

, Artificial Data Sets, (2007) http://pages.bangor.ac.uk/ mas00a/activities/artificial_data.htm.

31.

Cortes

and Vapnik

V.N.

, Support-vector networks, Machine Learning 20(3) (1995), 273–279.

32.

Deng

N.Y.

, Tian

Y.J.

and Zhang

C.H.

, Support Vector Machines - Optimization based Theory, Algorithms, and Extensions, CRC Press 2013.

33.

Hettich

and Bay

S.D.

, The UCI KDD Archive, 1999 http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

34.

Lippmann

R.P.

, Fried

D.J.

, Graf

et al., Evaluating Intrusion Detection Systems: The DARPA Off-line Intrusion Detection Evaluation, Proceedings of DARPA Information Survivability Conference and Exposition, 2000, pp. 12–26.