Dimensionality reduction of large datasets with explicit feature maps

Abstract

Learning algorithms are often equipped with kernels that enable them to deal with non-linearities in the data, which ensures increased performance in practice. However, traditional kernel based methods suffer from the problem of scalability. As a workaround, explicit kernel maps have been proposed in the past. Previously for the task of streaming Kernel Principal Component Analysis (KPCA), an explicit kernel map has been combined with a matrix sketching technique to obtain scalable dimensionality reduction (DR) algorithm. This algorithm is limited by two issues, both pertaining to the explicit kernel map and the matrix sketching algorithms respectively. As a solution, two new scalable DR algorithms called ECM-SKPCA and Euler-SKPCA are proposed. The efficacy of the proposed algorithms as scalable DR algorithms is demonstrated via the task of classification with many publicly available datasets. The results indicate that the proposed algorithms produce more effective features than the previous algorithm for the classification task. Furthermore, ECM-SKPCA is also demonstrated to be much faster than all other algorithms.

Keywords

Dimensionality reduction kernel streaming data explicit cosine map classification

1. Introduction

The technique of Principal Component Analysis (PCA) has been used successfully for DR in many applications such as change detection in satellite images [1, 2], analysis of gene expression data [3], stellar spectral classification [4] etc. The simplicity and power of this technique has made it very popular among the data mining practitioners. However, PCA is not capable of capturing non-linearities in the data well, which results in poor performance in certain datasets. To address this issue, Schölkopf et al. [5] proposed a non-linear variant of PCA called KPCA. KPCA has gained popularity due to its ability to discover Principal Components (PCs) from data of complex non-linear nature more effectively than PCA. This algorithm has been deployed in a number of applications like face recognition [6], stock price prediction [7], non-linear process monitoring [8] etc.

Both algorithms discover PCs for the data in a similar manner. The main difference between PCA and KPCA arises in the domain in which the algorithms operate. In PCA, works on the data in its input space, whereas, KPCA works on the data in a very high dimensional space or feature space. KPCA proceeds by applying a non-linear function to the given data, and then discovering the PCs similar to the PCA algorithm. This non-linear map, $\phi$ is an extremely high dimensional map, that is never explicitly computed. Instead, a kernel trick is employed, which involves the use of kernel functions. The advantage of using the high dimensional mapping (also called feature mapping) is that the non-linearities present in the data can be easily captured, and this method also performs better than PCA in datasets with non-linear relationships between its features. A kernel function computes the dot product between any two data points in a very high dimensional space or feature space. A kernel matrix $\mathbf{K}\in\mathbb{R}^{n\times n}$ can be thus constructed by applying the kernel function on all pairs of data points. The eigen decomposition of this matrix produces PCs for the data in the feature space.

KPCA first involves transforming the data in the input space to a higher dimensional space (feature space), $F$ via a kernel function $\chi$ . For the input matrix $\textbf{X}\in\mathbb{R}^{n\times d}$ , its points are $\mathbf{x}_{ij}$ , $i=1,\ldots,n$ , $j=1,\ldots,d$ . The kernel function essentially computes the inner product between the input points in the feature space.

$\displaystyle\chi(\mathbf{x}_{i},\mathbf{x}_{j})=\langle\phi(\mathbf{x}_{i}),% \phi(\mathbf{x}_{j})\rangle_{F}$ (1)

where $\phi$ is a high dimensional mapping function that is not explicitly applied. Instead the kernel function $\chi$ is directly applied on all pairs of data points. The kernel matrix $\mathbf{K}$ is thus formed by applying $\chi$ on all pairs of points. After $\mathbf{K}$ is obtained, the mapped points (entries of $\mathbf{K}$ ) are normalized in the feature space. This step is essential because PCA requires that the data be mean normalized. The eigendecomposition of $\tilde{\mathbf{K}}$ can then be done in order to obtain the PCs. Let $\mathbf{X}\in\mathbb{R}^{n\times d}$ be the input matrix for which the PCs are to be discovered.

Obtain the kernel matrix by applying the kernel function on all pairs of data points.

$\displaystyle\mathbf{K}_{ij}=\chi(\mathbf{X}^{i},\mathbf{X}^{j})$

Center the kernel matrix in the feature space so that its mean is zero. Here matrix $\mathbf{N}\in\mathbb{R}^{n\times n}$ contains the entries $\frac{1}{n}$ .

$\displaystyle\tilde{\mathbf{K}}_{ij}=\mathbf{K}-\mathbf{N}\mathbf{K}-\mathbf{K% }\mathbf{N}+\mathbf{N}\mathbf{K}\mathbf{N}$

Eigen decompose the kernel matrix in order to obtain the PCs.

$\displaystyle[\mathbf{U},\mathbf{S},\mathbf{V}]=\text{SVD}(\tilde{\mathbf{K}})$

Re-represent the data by projecting the data points on to the PCs obtained.

$\displaystyle\tilde{\mathbf{X}}=\mathbf{X}\mathbf{V}_{r}$

Kernelized versions of many machine learning algorithms have emerged in the past, among these are KPCA, kernel ridge regression (KRR), kernel based clustering. Despite its popularity, kernel based algorithms suffer from scalability issue. KPCA requires huge amounts of computational time and space when the data is large (number of examples are typically of the order of a few thousands). For a matrix $\mathbf{X}\in\mathbb{R}^{n\times d}$ , the kernel matrix is $\mathbf{K}\in\mathbb{R}^{n\times n}$ . In order to store the kernel matrix in memory, $O(n^{2})$ space is required, which becomes prohibitive for large $n$ . The computational complexity of performing KPCA depends on two tasks: computing $\mathbf{K}$ and performing the eigen decomposition of $\mathbf{K}$ . In the worst case, the time complexity of the KPCA algorithm is $O(n^{3})$ . The computation becomes a bottleneck in modern applications where huge amounts of data are encountered.

Besides, the scalability issue, when data is arriving in a streaming manner, it is no longer possible to apply the standard KPCA algorithm. Explicit feature maps such as Random Fourier Features (RFF) have emerged as an effective alternative. They serve as approximations to some well known kernel functions. RFF has a parameter $m$ that determines how accurate/close the kernel mapping is to the actual kernel function, which also affects the performance of learning tasks like classification and regression. When the data is arriving in a streaming manner, the PCs must be discovered on-the-fly. This cannot be done using standard PCA algorithm, which requires the whole data to be available at a time. An algorithm for performing KPCA in the streaming setting has been proposed by Ghashami et al. [9]. They combined RFF with a matrix sketching algorithm called Frequent Directions (FD). In this work, their work is referred to as RFF-SKPCA.

RFF-SKPCA is affected by the problems of RFF and FD. The RFF kernel map is associated with a time-accuracy trade-off which is influenced by its parameter $m$ , which needs to be large in order to obtain good empirical performance. However, this parameter $m$ is also a key factor in determining the time required to perform the kernel mapping. For any $d$ dimensional input vector, the time required to apply RFF on it is $O(md)$ . When $m$ is large, the kernel mapping time also becomes large. Moreover, the matrix sketching part of RFF-SKPCA uses the FD algorithm, which is a less accurate sketching method as demonstrated by Francis and Raimond [10].

In this work the issues related to the time-accuracy trade-off and matrix sketching part of the previous work are addressed. The proposed algorithms make use of accurate kernel maps Explicit Cosine Map (ECM) and Euler kernel map that were recently proposed by [11] and the $\beta$ FD algorithm [10].

1.1 Contributions

The main contributions of this work are as follows.

•
Fast and accurate SKPCA algorithms called ECM-SKPCA and Euler-SKPCA are proposed.
•
The benefits of using the proposed methods over RFF-SKPCA for scalable DR is demonstrated via classification tasks.

1.2 Scope

In this work, the proposed scalable DR algorithms are used as feature extraction algorithms. For the evaluation of the proposed algorithms, a linear classifier is used. A linear classifier is used in order to demonstrate the influence of the kernel part of the algorithms being compared. Large datasets ( $n$ and/or $d$ are large) with two or more classes have been used for the experimental evaluation.

2. Related work

In this section, we review dimensionality reduction methods that include linear and non-linear (kernel based) methods that are applicable to large datasets.

2.1 Linear methods

Incremental PCA (IPCA) [12, 13] is popular method to perform dimenisonality reduction in a streaming fashion. It estimates the eigenvectors by considering the datapoints one at a time. The advantage of this approach is that the whole data need never be stored, instead its features are extracted in a streaming manner. This approach is particularly useful in datasets with very large number of instances. Sparse PCA (SPCA) [14] is another approach to perform dimensionality reduction. In SPCA, the eigenvectors are constrained to have at the most $k$ number of non-zero components. This parameter $k$ is typically much smaller than $d$ , the number of features in the data.

2.2 Non-linear methods

KPCA is a non-linear extension of PCA. It has been demonstrated to extract representative features from complex datasets. However, KPCA suffers from scalability issues due to its $O(n^{3})$ time complexity. Several appoaches to handle this issue have been proposed. We focus on scalable explicit kernel maps that approximate a shift-invariant kernel.

2.2.1 RFF

Rahimi and Recht [15] proposed an explicit feature map called RFF. It is based on Bochner’s theorem, that relates a positive definite function with a positive measure. For a Gaussian kernel is given in Eq. (2), its Fourier transform is

$\displaystyle k(\mathbf{x},\mathbf{y})=e^{\frac{-\|\mathbf{x}-\mathbf{y}\|^{2}% }{2\sigma^{2}}}$ (2)

given in Eq. (3), which is a positive measure and hence a probability distribution by Bochner’s theorem.

$\displaystyle p(\omega)=\frac{1}{(\sqrt{2\pi})^{m}}e^{\frac{-\|\omega\|^{2}_{2% }}{2}}$ (3)

Let $F^{-1}$ denote the inverse Fourier transform operator, there exists:

$\displaystyle k(\mathbf{x},\mathbf{y})=e^{\frac{-\|\mathbf{x}-\mathbf{y}\|}{2% \sigma^{2}}}=F^{-1}(p(\omega))=\int_{\mathbb{R}^{d}}p(\omega)e^{jw^{T}(\mathbf% {x}-\mathbf{y})}d\omega\approx\frac{1}{m}\sum_{i=1}^{m}e^{j\omega_{i}^{T}(% \mathbf{x}-\mathbf{y})}=z(\mathbf{x})^{T}z(\mathbf{y})\qquad\left(z(\mathbf{x}% )=\frac{1}{\sqrt{m}}e^{j\omega^{T}(\mathbf{x})}\right)$

The second last equation is obtained using Monte-Carlo sampling for integration. An explicit map $z()$ approximates any shift invariant kernel, $k$ ( $k(x,y)=k(x-y)$ ).

$\displaystyle z(\mathbf{x})^{T}z(\mathbf{y})\approx k(\mathbf{x},\mathbf{y})$ (4)

The RFF kernel map is defined as follows.

$\displaystyle z(\mathbf{x})=[\cos(\bm{\omega}^{T}\mathbf{x}+\mathbf{b})]$ (5)

Here, let $p(\omega)$ denote the inverse Fourier transform of the Gaussian kernel, then $\bm{\omega}\in\mathbb{R}^{d},\mathbf{b}\in\mathbb{R}^{d},\bm{\omega}\sim p(% \mathbf{\omega})$ , $\mathbf{b}\sim\mathbb{N}(0,2\pi)$ . The variance of the approximation (of the Gaussian kernel) can be reduced by sampling $m$ random vectors from the inverse Fourier transform of the kernel function and concatenating these $m$ vectors vertically to construct the features vector, $\mathbf{z}\in\mathbb{R}^{d}$ . The resulting map is given in Eq. (6).

$\displaystyle\mathbf{z}(\mathbf{x})=\sqrt{\frac{2}{m}}\left[\begin{array}[]{c}% \cos(\bm{\omega}^{1,T}\mathbf{x}+\mathbf{b}_{1})\\ \ldots\\ \cos(\bm{\omega}^{m,T}\mathbf{x}+\mathbf{b}_{m})\\ \end{array}\right]$ (6)

Variants of RFF: Few variants that improve the speed of the kernel mapping have been proposed in the past. These include Fastfood [16], Orthogonal Random Features (ORF) [17], Structured ORF (SORF) [17]. The kernel mapping time of the variants are as follows. Fastfood has runtime $O(m\log(d))$ . ORF [17] takes $O(md)$ time. A faster version of ORF called SORF [17] was proposed later. It has a mapping cost of $O(m\log{d})$ .

2.2.2 RFF-SKPCA

Ghashami et al. [9] showed that RFF can be used in conjunction with FD in order to perform SKPCA. There are two main tasks in RFF-SKPCA.

1.
Kernel mapping part using RFF
2.
Sketching using FD

Let the original data point be denoted by $\mathbf{x}$ and the mapped point be $\mathbf{z}$ . This algorithm proceeds by mapping each point using RFF map. Then, the PCs of the data points observed so far are computed via the FD algorithm. This algorithm admits the following guarantee (Eq. (7)), where $\mathbf{V}$ (PCs) is the matrix returned by FD.

$\displaystyle\|\mathbf{K}-\hat{\mathbf{K}}\|_{F}\leqslant\|\mathbf{K}-\tilde{% \mathbf{K}}\|_{F}+\epsilon k^{1/2}n$ (7)

Here $\mathbf{K}$ is the exact kernel matrix, obtained after applying Gaussian kernel on the data, $\tilde{\mathbf{K}}=\frac{1}{m}\mathbf{z}(\mathbf{X})^{T}\mathbf{z}(\mathbf{X})$ is the kernel matrix obtained after applying RFF to the data and $\hat{\mathbf{K}}=\mathbf{Z}\mathbf{VV}^{T}\mathbf{Z}^{T}$ is the kernel matrix obtained after applying RFF and FD to the data.
2.2.3 Vanilla euler kernel map

This explicit kernel map was first proposed by Liwicki et al. [18]. It is basically a map $z:\mathbb{R}^{d}\rightarrow\mathbb{C}^{d}$ defined as follows.

$\displaystyle z(\mathbf{x})=e^{j\alpha\pi\mathbf{x}}$ (8)

Here $j=\sqrt{-1}$ and parameter $\alpha\in[0,2]$ . As $\alpha$ increases the mapping becomes more non-linear. Thus for small values of $\alpha$ , the mapping is linear. It has been successfully employed in computer vision applications. The time required to apply this map on a $d$ -dimensional vector is $O(d)$ . The corresponding kernel function is as follows.

$\displaystyle\chi(\mathbf{x}^{(p)},\mathbf{x}^{(q)})=\sum_{s=1}^{d}\{\cos(% \alpha\pi(\mathbf{x}^{(p)}_{s}-\mathbf{x}^{(q)}_{s}))-j\sin(\alpha\pi(\mathbf{% x}^{(p)}_{s}-\mathbf{x}^{(q)}_{s}))\}$ (9)

2.3 Limitations of previous works

First, the issues associated with the kernel mapping part are described. Lopez-Paz et al. [19] proved that $\mathbb{E}[\|\mathbf{K}-\tilde{\mathbf{K}}\|]\leqslant n\log^{1/2}(n)m^{-1/2}+% n\log(n)m$ , where $\tilde{\mathbf{K}}=\frac{1}{m}\mathbf{z}(\mathbf{X})^{T}\mathbf{z}(\mathbf{X})$ . This implies that the value of $m$ should be large for the error of kernel approximation to be low. Thus the mapping time would also increase proportionately.

The second issue concerns the effectiveness of FD in computing PCs. Accurately discovering PCs is an important part of SKPCA. This in turn ensures that the features obtained after projecting the data points on to the PCs are good enough for the successful application of some learning task such as classification. The nature of data also plays a significant role in determining how well FD performs. As demonstrated by Francis and Raimond [10], $\beta$ FD is a more accurate sketching algorithm than FD. Its accuracy has been demonstrated using datasets with characteristics such as noise, concept drift etc.

The vanilla Euler kernel map [18] has three issues associated with it. The first issue has to do with its complex valued mapping. Since the output (mapped data points) are complex-valued, they cannot be used straightforwardly in learning algorithms that expect real-valued inputs. The second issue has to do with the redundancy involved in storing the same information in the real and imaginary parts of the complex-valued input. This implies that $2d$ space is required to store a $d$ dimensional input vector’s mapped counterpart.

The third has to do with the lack of involvement of all the attributes in the final mapped output. The vanilla Euler map simply associates a mapped output ( $z(x^{1})$ ) with its corresponding input vector ( $x^{1}$ ) alone. The idea of involving all input attributes, ( $x^{1}_{1},\ldots x^{1}_{d},\ldots x^{n}_{1},\ldots,x^{n}_{d}$ ) in the final mapped output is a popular theme in machine learning algorithms.

The first issue associated with previous kernel based sketching algorithm is the requirement of large $m$ for achieving good kernel approximation accuracy. In this regard, one could resort to other explicit feature maps like vanilla Euler map, but there are certain issues associated with it as explained earlier. Solutions to these limitations are discussed in this section.

•
Complex-valued mapping: The solution is to use a real-valued mapping instead.

$\displaystyle z(\mathbf{x}):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ (10)
•
Redundancy in storage: This can be remedied by using just the real part of the kernel map.
•
Lack of involvement of attributes: This can be solved by applying a linear map ( $\mathbf{c}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{m}$ ) as follows.

$\displaystyle z(\mathbf{X})=\mathbf{CX}$ (11)

where $\mathbf{C}\in\mathbb{R}^{r\times d}$ . Each weight vector, $\mathbf{c}\in\mathbb{R}^{d}$ ( $C=[c^{1},c^{2},\ldots,c^{m}]$ ) serves as a coefficient to the linear combination of the attributes. One obvious choice for the value of this weight vector is to use a random vector, but this may not always work well. Another way is to set the weights by applying some learning mechanism to the data, but this is not possible in the streaming setting. Instead, a random matrix, $\mathbf{C}\in\mathbb{R}^{r\times d}$ whose elements are $\in\{0,1\}$ can be used. Its contents are drawn from the Bernoulli distribution. The matrix $\mathbf{C}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{r}$ basically transforms the $d$ dimensional vector $\mathbf{x}$ to a vector in the $r$ dimensional space. Besides, the contents of $\mathbf{C}$ ensure that not all attributes participate in the transformation. This is because some of its entries will be 0. It is shown later (Section 4) that this setting is beneficial.

The second issue is the lack of effectiveness of FD in producing good PCs. The construction of accurate sketches is crucial in discovering good PCs. FD is less accurate sketching method than $\beta$ FD. Thus, using $\beta$ FD for the sketching part can result in improved PC discovery. The issues of the previous kernel based mapping algorithms can be overcome by using the ECM and Euler kernel maps, which are described in below.
2.4 ECM

As discussed by [11], ECM can overcome the limitations of RFF. It takes the following form.

$\displaystyle\bm{z}(x^{i})=\left[\begin{array}[]{c}\cos(\alpha\pi\mathbf{c}^{1% ,T}\mathbf{x}^{i})\\ \cos(\alpha\pi\mathbf{c}^{2,T}\mathbf{x}^{i})\\ \ldots\\ \cos(\alpha\pi\mathbf{c}^{r,T}\mathbf{x}^{i})\\ \end{array}\right]$ (12)

Here $\mathbf{C}\in\mathbb{R}^{r\times d}$ , $\mathbf{C}=[\mathbf{c}^{1},\ldots,\mathbf{c}^{r}]^{T}$ , $\mathbf{c}^{j}\sim\textit{Bernoulli}$ and $\alpha\in[0,2]$ .

The time taken to apply ECM to a $d$ -dimensional input is $O(rd)$ . In Section 4, it is shown that for $r<m$ , better learning efficiency can be achieved.

2.5 Euler kernel map

The Euler kernel mapping is done as follows. The time required to apply the map to a $d$ dimensional vector is $O(rd)$ .

$\displaystyle\mathbf{z}^{(i)}=\left[\begin{array}[]{c}e^{(\alpha\pi\mathbf{c}^% {T}_{1}\mathbf{x}^{(i)})}\\ e^{(\alpha\pi\mathbf{c}^{T}_{2}\mathbf{x}^{(i)})}\\ \ldots\\ e^{(\alpha\pi\mathbf{c}^{T}_{r}\mathbf{x}^{(i)})}\\ \end{array}\right]$ (13)

3. Proposed methods

In this section, the proposed algorithms, ECM-SKPCA and Euler-SKPCA are explained in detail.

3.1 ECM-SKPCA

The steps are given in Algorithm 3.1. First, the matrix $C$ is constructed by drawing entries from $\{0,1\}$ . The ECM kernel map of Eq. (12) is applied to the data points in the data matrix. This is done via the procedure ApplyFeatureMap(). The PCs ( $\mathbf{V}$ ) are discovered from the mapped points $\mathbf{z}^{i},i=1,..,n$ by using the $\beta$ FD algorithm. The modifiedReduceRank() procedure returns the matrix $\bm{\Sigma}^{\prime}$ resulting from the application of the following function (Eq. (14)).

$\displaystyle\bm{\Sigma}^{\prime}=\textit{diag}\left(\left[\sqrt{\bm{\Sigma}_{% 1,1}^{2}-\text{\emph{attenuate}}(\beta,0,l)\delta},\ldots\sqrt{\bm{\Sigma}_{l,% l}^{2}-\text{\emph{attenuate}}(\beta,l-1,l)\delta}\right]\right)$ (14)

where $l=\text{size}(\bm{\Sigma})$ and $\delta=\bm{\Sigma}_{l,l}^{2}$ .

[H] : ECM-SKPCAInput: $\mathbf{X}\in\mathbb{R}^{n\times d}$ , $\beta\in(0,\infty)$ , $\alpha\in[0,2]$ , $\mathbf{C}\in\mathbb{R}^{r\times d}$ $l>0$ .Output: $\mathbf{V}\in\mathbb{R}^{r\times l}$ $\mathbf{B}\leftarrow 0^{l\times r}$ $i\in[1,\ldots,n]$ $\mathbf{z}^{(i)}\leftarrow\textit{ApplyFeatureMap}(\mathbf{x}^{(i)},\alpha,% \mathbf{C})$ $\mathbf{B}\leftarrow\mathbf{z}^{(i)}$ $\mathbf{B}$ has no zero valued rows $\left[{\mathbf{U},\bm{\Sigma},\mathbf{V}^{T}}\right]=\text{SVD}(\mathbf{B})$ $\delta=\bm{\Sigma}_{l,l}^{2}$ $\bm{\Sigma}^{\prime}=$ modifiedReduceRank( $\bm{\Sigma},\beta$ ) $\mathbf{B}\leftarrow\bm{\Sigma}^{\prime}\mathbf{V}^{T}$ $\mathbf{V}$

3.2 Euler-SKPCA

Euler kernel map can also be used to map data points to the kernel space. The SKPCA algorithm with Euler kernel map is referred to as Euler-SKPCA. The ApplyFeatureMap() applies the Euler kernel map of Equation (13) to the data points. The rest of the steps are the same as in Algorithm 3.1.

4. Experimental evaluation

In this section the effectiveness of the proposed kernel based sketching algorithms as feature extraction techniques is demonstrated. First, the features are extracted using the algorithms: RFF-SKPCA, Euler-SKPCA and ECM-SKPCA. Then, the extracted features are given to a classification algorithm. The classifier used is Support Vector Machine (SVM) with linear kernel. We compare the proposed algorithms with Incremental PCA (IPCA), Sparse PCA (SPCA) extracted features and a linear classifier.

4.1 Setup

All experiments were carried out in a Linux machine with 1.7GHz Intel Core i3 processor and 4GB of RAM. The code was written in Python. The matrix sketching algorithms, $\beta$ FD (for ECM-SKPCA and Euler-SKPCA) and FD (for RFF-SKPCA) have a parameter, which is the size of the sketch $l$ . In all the experiments, for COD-RNA, $l=5$ , and for HTTP and PPMI $l=10$ . The sketch size ( $l$ ) for all the other datasets is 100.

4.2 Datasets used and their characteristics

The datasets used for the evaluation of the algorithms are described below. The description of the datasets are given in Table 1.

Table 1
Datasets and their characteristics

Dataset	# Instances	# Attributes	# Classes
COD-RNA	488565	8	2
PPMI	3684	34	2
HTTP	200000	41	2
Spam	9324	499	2
Usenet	5995	500	2
ISOLET	480	617	2
HAR	10299	561	6
MNIST	70000	784	10
PEMS-SF	63360	963	7

•

Coding-Ribonucleic acid (COD-RNA) [20]: It contains 488565 genome sequences, each having 8 attributes. The number of class 0 instances and class 1 are 325710 and 162855 respectively.

•

PPMI [21]: This dataset contains 3684 patient records, out of which 1333 belong to class 0 and 2351 belong to class 1. The number of attributes in this dataset is 34. The dataset can be downloaded from https://www.ppmi-info.org/access-data-specimens/download-data/.

•

HTTP: This dataset contains 200000 instances, out of which 160555 belong to class 0 and 39445 belong to class 1. The number of attributes in the dataset is 41. The dataset can be obtained from UCI repository [22].

•

Spam: This dataset contains spam as well as normal (ham) messages. It can be downloaded from http://mlkd.csd.auth.gr/concept-drift.html. The first 499 attributes were used in the experiments. The total number of instances is 9324, in which 6937 belong to class 0 and 2387 belong to class 1.

•

Usenet: This dataset contains email messages that belong either of 2 classes. It can be downloaded from http://mlkd.csd.auth.gr/concept_drift.html. In the experiments, the total number of instances used is 5995 the first 500 attributes were used. Class 0 contains 2997 instances and class 1 contains 2998 instances.

•

ISOLET [23]: This dataset contains spoken English recordings of 30 speakers. 26 classes in this dataset with a total of 7797 instances. Data belonging to classes 1 and 15 were used in this work. The number of instances of classes 1 and 15 are 240 and 240 respectively. The size of the dataset used is $480\times 617$ .

•

HAR [24]: It contains data belonging to 6 classes with a total of 10299 instances. Size of the dataset is $10299\times 561$ .

•

MNIST [25]: It contains data belonging to 10 classes and the total number of instances is 70000. Data belonging to classes 1 to 10 were used in this work. The size of the dataset used is $70000\times 784$ .

•

PeMS SF [26]: This dataset contains 963 columns and 63360 rows.

4.3 Results

There are two kinds of results that are considered: Classification accuracy and the time taken to run the algorithms.

4.3.1 (a) Classification accuracy

Figure 1.

Comparison of accuracy for various values of $m/r$ .

After the feature extraction is done on the datasets, the classification accuracy on the test set is recorded. Each feature extraction algorithm has a parameter $r$ (for ECM-SKPCA and Euler-SKPCA) or $m$ (for RFF-SKPCA), and by varying this quantity, the resulting classification accuracy can be observed. Figure 1 shows the variation of accuracy of all the algorithms for all datasets.

For large as well as small datasets, the combination the proposed algorithm ECM-SKPCA and SVM obtains the best accuracy for all values of $r$ (Fig. 1a–i). This confirms the fact that the improved matrix sketching part of the proposed algorithm is indeed effective in capturing the relevant features needed to discriminate between the classes in the dataset. If the results of Euler-SKPCA are considered, it can be seen that it’s accuracy is slightly worse than ECM-SKPCA, but is in most cases, better than RFF-SKPCA.

In the case of RFF-SKPCA, it can be observed that it obtains lower accuracy than ECM-SKPCA on all datasets, for all values of $m$ . It is also observed to be worse than Euler-SKPCA for most datasets, except MNIST. In ISOLET (Fig. 1f), RFF-SKPCA $+$ SVM obtains lower accuracy than ECM-SKPCA $+$ SVM and Euler-SKPCA $+$ SVM for all values of $m$ except $m=300$ . For the multi-class classification of HAR dataset (Fig. 1g), Euler-SKPCA $+$ SVM obtains better accuracy than RFF-SKPCA $+$ SVM. In the MNIST dataset (Fig. 1h), Euler-SKPCA $+$ SVM obtains better accuracy than RFF-SKPCA $+$ SVM for $m=$ 100, 150, but obtains lower accuracy for $m>200$ . The Fig. 1i shows that Euler-SKPCA $+$ SVM obtains higher accuracy than RFF-SKPCA $+$ SVM for the PEMS-SF dataset.

4.3.2 (b) Time taken

Figure 2.

Variation of feature extraction time with parameter ( $m/r$ ).

Table 2

Summary of best accuracy obtained for each dataset

Dataset	Algorithm	Test accuracy
COD-RNA	RFF-SKPCA	0.87
	IPCA	0.83
	SPCA	0.31
	Euler-SKPCA	0.87
	ECM-SKPCA	0.88
PPMI	RFF-SKPCA	0.77
	IPCA	0.76
	SPCA	0.77
	Euler-SKPCA	0.78
	ECM-SKPCA	0.79
HTTP	RFF-SKPCA	0.98
	IPCA	0.96
	SPCA	0.94
	Euler-SKPCA	0.98
	ECM-SKPCA	0.99
Spam	RFF-SKPCA	0.94
	IPCA	0.81
	SPCA	0.81
	Euler-SKPCA	0.95
	ECM-SKPCA	0.97
Usenet	RFF-SKPCA	0.51
	IPCA	0.56
	SPCA	0.54
	Euler-SKPCA	0.55
	ECM-SKPCA	0.56
ISOLET	RFF-SKPCA	0.96
	IPCA	0.92
	SPCA	0.95
	Euler-SKPCA	0.98
	ECM-SKPCA	0.98
HAR	RFF-SKPCA	0.92
	IPCA	0.62
	SPCA	0.64
	Euler-SKPCA	0.92
	ECM-SKPCA	0.94
MNIST	RFF-SKPCA	0.87
	IPCA	0.31
	SPCA	0.11
	Euler-SKPCA	0.86
	ECM-SKPCA	0.89
PEMS-SF	RFF-SKPCA	0.53
	IPCA	0.47
	SPCA	0.51
	Euler-SKPCA	0.56
	ECM-SKPCA	0.57

In this section, the time taken to extract the features is measured for each value of the parameter $m$ or $r$ . From Fig. 2, it can be observed that, ECM-SKPCA is the fastest algorithm, regardless of the magnitude of $r$ . In all plots, the time taken by the algorithms increases as the values of the parameter $m/r$ increases. It can be observed that ECM-SKPCA takes the least amount of time for all values of $r$ , while Euler-SKPCA takes the most time for all datasets. The difference in timing between RFF-SKPCA and ECM-SKPCA is clear in datasets with and without large number of attributes ( $d$ ). This confirms our intuituion that using the explicit cosine map when combined with the improved matrix sketching algorithm ( $\beta$ FD) provides a fast dimensionality reduction algorithm.

4.4 Summary and results analysis

In the previous section, it was confirmed that ECM-SKPCA is indeed the fastest algorithm. In the task of scalable dimensionality reduction using approximation schemes, there is always a trade-off between the time to run and the performance of the algorithm (in terms of error or accuracy). In this work, an approach to achieve a scalable dimensionality reduction method has been demonstrated to be effective in terms of classification accuracy, while being faster than a previous related approach. The summary of best classification accuracy obtained by all the algorithms and corresponding values of $m/r$ are given in Table 2. The experimental results show that the proposed algorithm overcomes the drawbacks of the previous algorithms. In summary, the following conclusions can be drawn from the experimental results.

•
Good overall classification accuracy: In all datasets, ECM-SKPCA + SVM obtains the best classification accuracy. This is confirmed by the results in Fig. 1 and Table 2. Euler-SKPCA $+$ SVM also achieves better accuracy than RFF-SKPCA $+$ SVM in many datasets. This confirms the idea that improving the matrix sketching part and the kernel approximation can result in a scalable yet accurate feature extraction algorithm. Moreover, the results obtained were consistent on datasets of varying sizes and complexity (in terms of tasks associated with it), which is indicative of the wide range of applicability of the proposed algorithms.
•
Better classification accuracy time trade-off: The experimental results indicate that good accuracy of classification is obtained by ECM-SKPCA + SVM for even small values of $r$ . This implies that the proposed algorithm, ECM-SKPCA can run much faster than other algorithms. This is confirmed by results in Figs 1 and 2. Thus, the aim of proposing a more accurate and faster algorithm than RFF-SKPCA has been achieved.
•
Faster: The proposed algorithm, ECM-SKPCA is clearly the fastest algorithm, while Euler-SKPCA is the slowest. This is confirmed by the results in Fig. 2. This speedup has been achieved as a result of using a faster kernel map.

5. Conclusion

Kernel based DR algorithms are useful in extracting meaningful features from complex datasets. An existing algorithm, RFF-SKPCA combines the power of kernels and a matrix sketching algorithm, FD for performing DR. This algorithm requires the parameter $m$ of RFF to be large in order to achieve good accuracy. This in turn increases the running time of the algorithm. In order to achieve faster and more accurate kernel based DR, two kernel based sketching algorithms are proposed in this work. The first algorithm, ECM-SKPCA uses the proposed fast and accurate explicit kernel map ECM and $\beta$ FD for extracting the features. The ECM-SKPCA algorithm is shown to be faster and more accurate in terms of classification accuracy when compared to RFF-SKPCA. For all datasets, the classification accuracy of ECM-SKPCA $+$ SVM is found to be the best. The second algorithm Euler-SKPCA is slightly slower than ECM-SKPCA, but it is shown to be more accurate than RFF-SKPCA on many datasets. The experimental results demonstrate the effectiveness of the proposed algorithms as DR techniques.

Footnotes

Acknowledgments

This publication is an outcome of the R&D work undertaken project under the Visvesvaraya PhD Scheme of Ministry of Electronics and Information Technology, Government of India, being implemented by Digital India Corporation.

Supporting information

References

Celik

. Unsupervised change detection in satellite images using principal component analysis and k-means clustering. IEEE Geoscience and Remote Sensing Letters. 2009; 6(4): 772–776.

Byrne

Crapper

Mayo

. Monitoring land-cover change by principal component analysis of multitemporal Landsat data. Remote Sensing of Environment. 1980; 10(3): 175–184.

Yeung

Ruzzo

. Principal component analysis for clustering gene expression data. Bioinformatics. 2001; 17(9): 763–774.

Singh

Gulati

Gupta

. Stellar spectral classification using principal component analysis and artificial neural networks. Monthly Notices of the Royal Astronomical Society. 1998; 295(2): 312–318.

Schölkopf

Smola

Müller

. Kernel principal component analysis. In: Proceedings of International Conference on Artificial Neural Networks. Springer; 1997. pp. 583–588.

Wang

. Kernel principal component analysis and its applications in face recognition and active shape models. In arXiv preprint, arXiv:12073538, 2012; 1–9.

Chang

. A critical feature extraction by kernel PCA in stock trading model. Soft Computing-A Fusion of Foundations, Methodologies and Applications. 2015; 19(5): 1393–1408.

Lee

Yoo

Choi

Vanrolleghem

Lee

. Nonlinear process monitoring using kernel principal component analysis. Chemical Engineering Science. 2004; 59(1): 223–234.

Ghashami

Perry

Phillips

. Streaming kernel principal component analysis. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics; 2016. pp. 1365–1374.

10.

Francis

Raimond

. An improvement of the parameterized frequent directions algorithm. Data Mining and Knowledge Discovery. 2018; 32(2): 453–482. Available from: doi: 10.1007/s10618-017-0542-x.

11.

Francis

Raimond

. A fast and accurate explicit kernel map. Applied Intelligence. 2019; Available from: doi: 10.1007/s10489-019-01538-w.

12.

Artac

Jogan

Leonardis

. Incremental PCA for on-line visual learning and recognition. In: 2002 International Conference on Pattern Recognition. vol. 3. IEEE; 2002. pp. 781–784.

13.

Balsubramani

Dasgupta

Freund

. The fast convergence of incremental PCA. Advances in neural information processing systems. 2013; 26.

14.

Zou

Hastie

Tibshirani

. Sparse principal component analysis. Journal of Computational and Graphical Statistics. 2006; 15(2): 265–286.

15.

Rahimi

Recht

. Random Features for Large-Scale Kernel Machines. In: Proceedings of the 22nd Advances in Neural Information Processing Systems (NIPS); 2008. pp. 1177–1184.

16.

Sarlós

Smola

. Fastfood-approximating kernel expansions in loglinear time. In: Proceedings of the 30th International Conference on Machine Learning (ICML); 2013. pp. 244–252.

17.

Felix

Suresh

Choromanski

Holtmann-Rice

Kumar

. Orthogonal random features. In: Proceedings of the 30th Advances in Neural Information Processing Systems (NIPS); 2016. pp. 1975–1983.

18.

Liwicki

Tzimiropoulos

Zafeiriou

Pantic

. Euler principal component analysis. International Journal of Computer Vision. 2013; 101(3): 498–518.

19.

Lopez-Paz

Sra

Smola

Ghahramani

Schölkopf

. Randomized Nonlinear Component Analysis. In: Proceedings of the 31st International Conference on Machine Learning (ICML); 2014. pp. 1359–1367.

20.

Uzilov

Keegan

Mathews

. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics. 2006; 7(1): 143–173.

21.

Marek

Jennings

Lasch

Siderowf

Tanner

Simuni

, et al. The parkinson progression marker initiative (PPMI). Progress in Neurobiology. 2011; 95(4): 629–635.

22.

Http. UCI Machine Learning Repository; 1999. Accessed 2017-01-03. Available from: http://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/.

23.

Fanty

Cole

. Spoken letter recognition. In: Proceedings of Advances in Neural Information Processing Systems (NIPS); 1991. pp. 220–226.

24.

Anguita

Ghio

Oneto

Parra

Reyes-Ortiz

. A public domain dataset for human activity recognition using smartphones. In: Proceedings of the 21st European Symposium on Artificial Neural Networks (ESANN); 2013. pp. 437–442.

25.

LeCun

Cortes

Burges

. The MNIST database of handwritten digits; 1998. Available from: http://yann.lecun.com/exdb/mnist/.

26.

Cuturi

. Fast global alignment kernels. In: Proceedings of the 28th International Conference on Machine Learning (ICML); 2011. pp. 929–936.

27.

Maaten

Lvd

Hinton

. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008; 9(Nov): 2579–2605.

Dimensionality reduction of large datasets with explicit feature maps

Abstract

Keywords

1. Introduction

• Fast and accurate SKPCA algorithms called ECM-SKPCA and Euler-SKPCA are proposed. • The benefits of using the proposed methods over RFF-SKPCA for scalable DR is demonstrated via classification tasks. 1.2 Scope

2. Related work

2.1 Linear methods

2.2 Non-linear methods

2.2.1 RFF

3.1 ECM-SKPCA

4. Experimental evaluation

4.1 Setup

4.2 Datasets used and their characteristics

Table 1 Datasets and their characteristics

4.3.1 (a) Classification accuracy

Footnotes

Acknowledgments

Supporting information

References

•
Fast and accurate SKPCA algorithms called ECM-SKPCA and Euler-SKPCA are proposed.
•
The benefits of using the proposed methods over RFF-SKPCA for scalable DR is demonstrated via classification tasks.

1.2 Scope

Table 1
Datasets and their characteristics