Irregular features disrupt the desired classification. In this paper, we consider aggressively modifying scales of features in the original space according to the label information to form well-separated clusters in low-dimensional space. The proposed method exploits spectral clustering to derive scaling factors that are used to modify the features. Specifically, we reformulate the Laplacian eigenproblem of the spectral clustering as an eigenproblem of a linear matrix pencil whose eigenvector has the scaling factors. Numerical experiments show that the proposed method outperforms well-established supervised dimensionality reduction methods for toy problems with more samples than features and real-world problems with more features than samples.
Dimensionality reduction is a technique for reducing the number of variables of data samples and has been successfully applied in many fields to make machine learning algorithms faster and more accurate, including the pathological diagnoses of gene expression data [28], the analysis of chemical sensor data [17], the community detection in social networks [29], the analyses of neural spike sorting [1], and others [24]. Due to their dependence on label information, dimensionality reduction methods can be divided into supervised and unsupervised methods. Typical unsupervised dimensionality reduction methods are the principal component analysis (PCA) [13, 16], the classical multidimensional scaling (MDS) [4], the locality preserving projections (LPP) [12], and the t-distributed stochastic neighbor embedding (t-SNE) [30].
To make use of prior knowledge on the labels, we focus on supervised dimensionality reduction methods. Supervised dimensionality reduction methods map data samples into an optimal low-dimensional space for satisfactory classification while incorporating the label information. One of the most popular supervised dimensionality reduction methods is the linear discriminant analysis (LDA) [3], which maximizes the between-class scatter and reduces the within-class scatter in a low-dimensional space. However, LDA is based on the assumption that the input data obeys a Gaussian distribution and may fail to capture the underlying local data structure. To overcome its drawback, variants of LDA have been proposed. Bressan and Vitrià [5] proposed a variant that preserves the local data structure by redefining the between-class scatter and within-class scatter matrices based on the -nearest neighbors of each data sample. Weinberger and Saul [32] considered penalizing large distances among data samples with the same label and the -nearest neighbors and small distances among data samples with different labels according to a cost function. Sugiyama proposed the local Fisher discriminant analysis (LFDA) [26], which maximizes the between-class separability and preserves the within-class local structure, incorporating LPP for supervised dimensionality reduction. Cai et al. [7] proposed another variant, which captures the local geometrical data structure to map -nearest neighbors in the same classes and those in different classes to a low-dimensional space as close and as distant as possible, respectively. Instead of preserving the local data structure based on the distances among data samples in the original data space, Li et al. [18] introduced a weight between a pair of samples in the same class and exploited the local manifold structure of data samples in a low-dimensional space. In the context of recent metric learning, Mu [21] proposed supervised low-rank metric learning for an optional Mahalanobis distance (fixed-rank metric learning (FRML)), and Harandi et al. [10] proposed joint learning for dimensionality reduction mapping and metric in the reduced space (dimensionality reduction metric learning (DRML)).
In addition to capturing the local structure, as in LDA and its variants, we consider aggressively modifying the scales of the features in the original space before mapping the data samples to a low-dimensional space. This results in reducing the effect of features that interfere in the desirable classification as well as enhancing the separations of different clusters. For these purposes, we derive quantities called feature scaling factors, which are used to scale features based on spectral clustering (SC) for dimensionality reduction [19]. SC projects given data samples to a low-dimensional space formed by eigenvectors of a similarity or Laplacian matrix. In [35, 34], it was reported that SC is effective in detecting cluster structures, and in [33], it was reported that SC preserves the local data structure for feature selection. These reports motivated us to also use SC to learn local cluster structures.
Our purpose is to scale features of given data samples and form (well-separated) clusters in a lower dimensional space. Previous spectral feature scaling methods aimed at binary classification [19]: in this paper, we extend the scope of classification problems to handle multiclassing. Multiclass classification problems with more than two classes have typically been solved by combining independently produced binary classification problems [14]. We take an approach different from [14] to enable a multiclass classification. The contributions of this study can be summarized as follows:
Extension to multiclass classification
We extend the scope of problems that the spectral feature scaling method can handle to multiclass classification. Previous formulations allow only binary classification and require repeated applications of the method for multiclass classification.
Automatic parameter tuning
We propose an automatic tuning technique to determine the values of parameters that specify entries of the (Fielder) eigenvectors according to the labels of training data samples. The extension of the spectral feature scaling method to multiclass classification naively increases the number of hyperparameters and the computational cost. Our technique can save this cost in the multiclass case.
The remainder of this paper is organized as follows. In Section 2, we present preliminaries. In Section 3, we present the proposed spectral feature scaling method. In Section 4, we describe the difference between the proposed method and previous methods. In Section 5, we present experimental results comparing the proposed method with previous methods. In Section 6, we conclude the paper. Throughout the paper, we use italic bold characters to represent column vectors.
Preliminaries
Conventional spectral clustering nonlinearly reduces the dimensionality of data samples but may not form the desired data sample clusters due to irregularity of the data features. To improve the classification of reduced data samples, we may modify the scales of features. Simple feature scaling techniques, such as centering, standardization, and normalization, serve as preprocessing for the support vector machine and -means algorithm. This motivated us to propose supervised feature scaling that enhances the separation of reduced data samples based on prior knowledge from training data.
The spectral clustering partitions a weighted undirected graph in which the edge weight between nodes and is defined according to the similarity between data samples and , where and is the number of data samples. We use the locally scaled Gaussian kernel function [36] to define the similarity between data samples and as follows:
where , and denotes the local scaling of data samples around . Thus, we convert a given set of data samples to a graph according to similarities or distances among data samples. A similarity graph is given by a model of the local relationship between data samples. We set , where is the th nearest neighbor of . Zelnik-Manor and Perona [36] recommended the value for in spectral clustering in practice, and we adopted this choice.
Spectral clustering partitions a graph into subgraphs that are contradictory to each other and forms clusters from the corresponding data samples. Let be the similarity matrix,
be a vector of all ones, and be an indicator such that when data sample belongs to a cluster and when data sample belongs to another cluster. Then, the graph partitioning is given by the constrained minimization problem of the normalized cut (Ncut) function [23]:
The entry takes two discrete values to devide the set of data samples into two clusters. By the continuous relaxation to , Eq. (2) is reduced to finding the Fiedler vector [9] corresponding to the minimum eigenvalue of the constrained generalized eigenvalue problem
if the minimum eigenvalue has multiplicity one, where is the Laplacian matrix. The entries of the Fiedler vector are referred to as the coordinates of the data samples in the reduced space. In practice, we use more eigenvectors corresponding to the minimum eigenvalues of Eq. (4) to form a larger space for accurate classification. When clustering well-separated data samples, we will find hyperplanes that separate the data samples with reduced dimensions formed by a few eigenvectors in Eq. (4). Here, if in Eq. (1), i.e., is the identity matrix, Eq. (4) is reduced to a conventional spectral clustering method with relaxed Ncut [23]. If , Eq. (4) is a generalization of the method [23] and we can calculate the diagonal entries of , which are called the scaling factors, for each data feature for binary classification problems [19].
Proposed method
In this section, we formulate the spectral feature scaling method (SFS) for dimensionality reduction. The method was originally proposed for binary clustering and classification, but it is here extended to multiclass classification. The proposed method does not use typical means of extension to multiclass classification, such as multiclass support vector machines [14]. The new formulation of SFS computes scaling factors for each feature and merges them into a single scaling factor. Here, the parameter is the number of combinations that divide the set of data samples into two classes: we will discuss the selection of in Section 3.3.
If the labels of the training data samples are known, we use discrete variable in Eq. (2) to prescribe the entries of the eigenvectors of the Laplacian eigenvalue problem
where , , , , and depend on the feature scaling factors , and is a parameter: we will discuss how to estimate the values of in Section 3.3. The entries and are analogously defined as Eq. (2). We form the scaling matrix from and will discuss the detail in Section 3.2. Then, we apply to data , which consists of the training data and the test data as follows:
With the feature scaling , we enhance the separation of the reduced data samples as desired in the low-dimensional space formed by spectral clustering. This will improve the classification results. Finally, we classify the scaled data using eigenvectors corresponding to the positive minimum eigenvalue of the Laplacian eigenvalue problem
where , ,
is the th column of , , , and . The values of parameters and can be determined by cross-validation.
Formulation
Now, we reformulate Eq. (5) as another eigenvalue problem to extract the scaling factors as an eigenvector. Denote the scaling matrix by
and the entry of the similarity matrix of scaled data by
where , , and the th entry of is [22]. Here, we used the first-order approximation of the exponential function for . Then, the th row of is
where is an -dimensional vector with zero in the th entry and ones in all other entries, and
Furthermore, the feature scaling factors have the constraint
Combining Eqs (9)–(3.1), we can obtain the generalized eigenvalue problem of a linear matrix pencil
where
and we solve Eq. (13) for the scaling factors for each . If the eigenproblems Eq. (13) have a common eigenpair, they can be combined as
where . The existence of an eigenvalue of Eq. (13) is assured by the equality . Because the Fiedler vector is associated with the smallest eigenvalue, we adopt the eigenvector of Eqs (13) and (17) associated with the eigenvalues and closest to but less than one as a candidate feature scaling factors, respectively. Here, the eigenproblem Eq. (17) may not have the same eigenpair, so we find an eigenpair of Eqs (14) and (15) as close as possible in the eigenproblems. To solve Eqs (13) or (17), we can use the contour integral method [20] for ; otherwise, we can also use the minimal perturbation approach [15].
Integrating candidates of feature scaling factors
Here, we describe how to determine the scaling factors from the candidates in the multiclass case. The scaling factors given by solving Eq. (13) are used to form a single feature scaling matrix . We have the following five approaches to computing the diagonal elements of :
We can obtain by using the principal component, i.e., the eigenvector associated with the maximum eigenvalue of the eigenproblem
where is th row of the matrix
and . Thus, is obtained.
Arithmetic mean .
Geometric mean .
Root mean square .
Harmonic mean .
We will compare the classification accuracy of each approach for integrating scaling factors in the numerical experiments.
Number of prescribed eigenvectors and prescription of the entries
Consider determining a reasonable number of prescribed eigenvectors to prevent solving many eigenvalue problems Eq. (13) or a large eigenvalue problem Eq. (17). Let be the number of classes. In particular, if , we prescribe only one eigenvector in Eq. (5), as in [19]. This is because one eigenvector is sufficient for binary classification if the clusters are well-separated [8]. If , we split the set of training data samples into combinations of binary classes with two options: take or take at least and determine the entries of the eigenvector as
for according to the label information of the training data, where is a parameter we discuss how to determine later in this section. Remark that the -dimensional indicator vector for sample is uniquely chosen to a label. This is a solution of discrete optimization for a binary clustering Eq. (2) reduced from multiclass classification. That is, one cluster consists of data samples belonging to several classes, and the other clusters consist of data samples belonging to other classes. The classification accuracy varies depending on the combination of such binary classifications. By setting , the computational cost for solving Eqs (13) or (17) is large, while the classification accuracy tends to be less sensitive to the choice of combinations. When setting as small, the classification accuracy tends to be sensitive and the computational cost for solving Eqs (13) or (17) is small. Hence, there is a trade-off between the cost and sensitivity of classification accuracy for these approaches.
In the previous SFS with in Eq. (18), we empirically determined the value of [19]. In this study, we found that the values of in Eq. (18) determined by
according to the label information of training data give favorable results, where indicates that data sample belongs to a cluster, and indicates that data sample belongs to another cluster. The choice of or is flexible in Eq. (19) and can be fixed by cross-validation in practice (see Section 5). Note that the ideal Fiedler vector does not necessarily take such an entry and is not necessarily unique.
We summarize the procedures of the proposed method in Algorithm 3.3.
[!h] Supervised dimensionality reduction method with spectral features scaling[1] Training data with samples and features, test data , number of reduced dimensions . Low-dimensional data samples . Decide the number of eigenvectors to prescribe . Compute the parameter using and set the eigenvectors . Compute , and using Eqs (10), (11), and (16). Solve Eq. (13) for an eigenvector . Compute the scaling matrix from the scaling factors (Section 3.3). Compute the scaled data matrix Eq. (6). Compute the similarity matrix Eq. (7) and . Compute the Laplacian matrix . Solve for the eigenvectors corresponding to the smallest nonzero eigenvalues.
Related methods
In this section, we discuss the difference between the proposed method and previous methods. Many supervised dimensionality reduction methods, such as LDA and its variants, utilize knowledge from labels and map given data samples to an optimal low-dimensional space. The objective of these methods is to find a transformation matrix to map given data samples to low-dimensional data samples
whereas the objective of our method is to generate a scaling matrix and modify the scales of the features of given data samples as and then map them to low-dimensional data samples using a transformation matrix :
where is a transformation with a nonlinear kernel function. The methods in [18, 31] modify the weights among data samples, whereas the proposed method modifies the scales of features in the original space and uses the same feature scaling factors for all data samples.
Numerical experiments
We compare the proposed method (SFS) with previous supervised dimensionality reduction methods and metric learning method in terms of accuracy by numerical experiments on artificial data and practical datasets. We show that the proposed method outperforms previous methods in some cases, and is more robust than the previous methods. The previous methods were kernel versions of LDA (KDA) [2, 6] and LFDA (KLFDA) [27], LADA [18] and FRML [21]. The performance measures used for comparisons were overall accuracy (OA), average accuracy (AA), and normalized mutual information (NMI) [%] [25]. Here, OA and AA are definedas
where is the number of samples in class , is the number of samples classified by a method to class , and is the number of classes. We performed five-fold cross-validation and show the mean and standard deviation of each performance measure. The reduced dimension and the constraint rank for FRML were chosen by four-fold cross-validation regarding OA between one to the number of features when the number of features is smaller than the number of training data samples and between one to the number of training data samples, otherwise. In SFS, we used the seven nearest neighbors to sparsify the matrix . We chose optional values of the regularization parameters of FRML among 1, 10, 10, 10, and 10 according to cross-validation and, and of LADA [31] among 1, 10, 10, and 10 according to cross-validation. In SFS, we chose the value of or to which class is set when using the number of classes as the number of reduced dimensions (Section 3.3), chose the number of reduced dimension , and chose the value of again. We used the KLFDA code from the Sugiyama-Sato-Honda Lab site1
http://www.ms.k.u-tokyo.ac.jp/.
and programmed the LADA and KDA codes. All programs were coded and run in MATLAB 2018a.
Visualization of the artificial data in three dimensions.
Mean of each performance measure vs. variance .
Artificial datasets
In this subsection, we report results on a toy problem consisting of 600 data samples with ten features. The data samples had three classes, and each class had 200 samples. Figure 1 shows three of the ten features, where the symbols , , and represent data samples in classes 1, 2, and 3, respectively, and the data samples in each class are normally distributed along a ring. The rings of classes 1 and 2, and those of classes 2 and 3 intersect, respectively. The variances of the first to third features are 0.35, 2.01, and 0.19, respectively. The remaining seven features are normally distributed with center zero and variance . We changed the variance from 1 to 25 in increments of 1. These seven features interfere with data sample classification according to the first three features. The classification of data samples with the reduced dimension was done by logistic regression. Results for other classifiers are given in the Appendix. In SFS, the number of prescribed eigenvectors was set to the number of classes.
Test data samples for in the original and reduced two-dimensional space.
Test data samples for in the original and reduced two-dimensional space.
Values of the scaling factors for (a) and (b) when using the root mean square.
Specifications of datasets
# of samples
# of features
# of classes
(a) Colon
62
2000
2
(b) CLL_SUB_111
111
11340
3
(c) Lung
203
3312
5
(d) Lymphoma
96
4026
9
(e) WarpPIE10P
210
2420
10
Performance rates (mean% std) for real-world datasets
OA
AA
NMI
(a) Colon
SFS (RMS)
81.7
81.3
35.1
LADA
63.3
57.5
12.5
KLFDA
83.3
78.8
36.7
KDA
66.7
50.0
NaN
FRML
71.7
65.0
15.3
(b) CLL_SUB_111
SFS (RMS)
68.2
72.7
37.3
LADA
53.6
38.0
11.2
KLFDA
67.3
70.7
34.2
KDA
50.0
35.3
15.9
FRML
45.5
57.3
31.2
(c) Lung
SFS (RMS)
94.5
85.4
85.7
LADA
93.5
79.4
79.8
KLFDA
93.0
82.0
78.7
KDA
64.5
20.4
8.5
FRML
91.0
76.3
75.3
(d) Lymphoma
SFS (RMS)
94.8
94.8
94.6
LADA
60.0
42.8
62.7
KLFDA
90.5
80.0
90.7
KDA
47.4
11.1
NaN
FRML
62.1
49.8
67.2
(e) WarpPIE10P
SFS (RMS)
88.6
88.9
90.3
LADA
82.9
82.6
83.7
KLFDA
99.5
99.5
99.5
KDA
23.3
8.7
24.1
FRML
34.8
35.2
54.1
Figure 2 shows the performance measures for variances where SFS (one-versus-all) represents the “one-vs-all” approach [11, p. 658] for the previous SFS for a binary classification problem. Regarding the performance measures, some of the proposed methods were more accurate than the previous methods, and KDA was the worst regardless of the variance. SFS (PCA) was the best when the variance was small, while SFS(arithmetic) and SFS (RMS) were the best when the variance was large. SFS (RMS), SFS (arithmetic), and LADA tended to be robust against interfering features. SFS (one-vs-all) was almost as accurate as KDA and FRML, and the line overlaps with the KDA line and FRML line. LADA was more accurate than KLFDA, though LADA is not kernelized. Among the compared SFS methods, the RMS approach was most accurate for many values of . The proposed SFS methods were more accurate than the conventional “one-vs-all” approach for the multiclass classification problem.
Figures 3 and 4 show the first and second features of the original data in parts (a) and the two-dimensional data samples reduced by each method in parts (b)–(d) for and 25, respectively. LADA and KLFDA failed to enhance the separations of clusters, while the proposed method separated clusters well.
Figure 5 shows the absolute values of the obtained scaling factors of SFS (RMS) for and . Because the fourth to tenth features given by random numbers are not involved in the classification, the values of the scaling factors of these seven features were small. Thus, the values of the scaling factors indicate which features are effective in the desired classification.
Real-world datasets
In this section, we report results for five datasets with more features than samples in the real world from a public repository.2
http://featureselection.asu.edu/datasets.php.
Table 1 gives the numbers of samples, features, and classes of the datasets. Datasets (a)–(d) are of gene expressions, and dataset (e) is of face images. In SFS, the number of prescribed eigenvectors was set to the number of classes, and we used RMS to generate the feature scaling matrix. The classification was done by KNN for since it was the best in the experiment in Section 5.1.
Table 2 gives the mean and standard deviation of the performance measures for each method. For the colon and lymphoma, the NMI of KDA was Not-a-Number (NaN) because KDA gave the same label to all test data samples. In all performance measures, SFS (RMS) was the best or second best for all datasets. KLFDA was the best for warpPIE10P.
Conclusions
In this paper, to deal with irregularity or uncertainty in features, we extended previously proposed feature scaling methods to multiclass classification and proposed a supervised dimensionality reduction method that exploits knowledge of labels of training data samples. The proposed method aggressively modifies the scales of features, and feature scaling can reduce the effects of features that prevent us from obtaining the desired clusters. To obtain the factors for scaling the features based on spectral clustering, we derived matrix eigenproblems whose eigenvectors have the feature scaling factors and described the procedures for the proposed method. The multiclass extension asks for more hyperparameters and computational cost. To reduce the number of hyperparameters, we proposed an automatic tuning technique based on the formula for discrete optimization to determine the entries of prescribed eigenvectors. The demanding combinations of entries of prescribed vectors are related to the number of eigenproblems to solve and can be reduced with a trade-off between accuracy and computational cost. Numerical experiments showed that feature scaling is effective when combined with the proposed classification method. For toy problems with more samples than features, feature scaling improved accuracy and was more robust to interfering features. In terms of flexibility in merging candidates of scaling factors for each feature, our experimental results indicate that the RMS approach is preferable in practice. For real-world problems with more features than samples, the proposed method was more robust than previous methods and outperformed previous methods in some cases.
Footnotes
Acknowledgments
The second author was supported in part by JSPS KAKENHI Grant Number 16K17639 and the Hattori Hokokai Foundation. The third author was supported in part by the Japan Science and Technology Agency, ACT-I (No. JPMJPR16U6), New Energy and Industrial Technology Development Organization, and JSPS KAKENHI Grant Numbers 17K12690, 18H03250.
Appendix A. Comparison of classifiers
In Section 5.1, we used logistic regression to classify data samples with reduced dimensions. In fact, the proposed method is not specialized to a particular classifier, which means we can choose an arbitrary classifier after the dimensionality reduction. In this appendix, we observe the dependence of the accuracy on classifiers and compare four classifiers: logistic regression (LR), -nearest neighbor algorithm (KNN), support vector machine (SVM), and naive Bayes classifier (NB). Among the proposed methods, we used SFS (RMS) because it performed the best in many cases.
Figure 6 shows the mean OA for each method and each classifier. SFS (RMS) was more accurate than other methods for every classifier. SFS (RMS) with KNN was more accurate than the other classifiers. The relative positions of the compared methods are the same for all performance measures.
Mean performance measure (OA) vs. classifier.
Figure 7 shows the OA of each method for in the KNN. For , SFS (RMS) was the best and its OA was 93.7%. The accuracies of the compared methods did not largely depend on the value of . The relative positions of the compared methods are also the same for all in the KNN.
Mean of performance measure (OA) vs. in the -NN classifier.
References
1.
Bar-HillelA.SpiroA. and StarkE., Spike sorting: Bayesian clustering of non-stationary data, Journal of Neuroscience Methods157(2) (2006), 303–316.
2.
BaudatG. and AnouarF., Generalized discriminant analysis using a kernel approach, Journal of Neural Computation12(10) (2000), 2385–2404.
3.
BelhumeurP.N.HespanhaJ.P. and KriegmanD.J., Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence19(7) (1997), 711–720.
4.
BorgI. and GroenenP.J.F., Modern Multidimensional Scaling: Theory and Applications, Springer, New York, 1997.
5.
BressanM. and VitriàJ., Nonparametric discriminant analysis and nearest neighbor classification, Pattern Recognition Letters24(15) (2003), 2743–2749.
6.
CaiD.HeX. and HanJ., Efficient kernel discriminant analysis via spectral regression, in: Seventh IEEE International Conference on Data Mining (ICDM 2007), 2007, pp. 427–432.
7.
CaiD.HeX.ZhouK.HanJ. and BaoH., Locality sensitive discriminant analysis, in: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2011), 2007, pp. 708–713.
8.
ChungF.R.K., Spectral Graph Theory, American Mathematical Society, Providence, RI, 1996.
9.
FiedlerM., Algebraic connectivity of graphs, Czechoslovak Mathematical Journal23(2) (1973), 298–305.
10.
HarandiM.SalzmannM. and HartleyR., Joint dimensionality reduction and metric learning: A geometric take, in: Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 1404–1413.
11.
HastieT.TibshiraniR. and FriedmanJ., The Elements of Statistical Learning, Springer, 2nd edition, New York, 2009.
12.
HeX. and NiyogiP., Locality preserving projections, in: Proceedings of the 16th International Conference on Neural Information Processing Systems, 2003, pp. 153–160.
13.
HotellingH., Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology24(6) (1933), 417–441.
14.
HsuC.W. and LinC.-J., A comparison of methods for multiclass support vector machines, IEEE Transactions on Neural Networks13(2) (2002), 415–425.
15.
ItoS. and MurotaK., An algorithm for the generalized eigenvalue problem for nonsquare matrix pencils by minimal perturbation approach, SIAM Journal on Matrix Analysis and Applications37(1) (2016), 409–419.
16.
JolliffeI.T., Principal component analysis and factor analysis, Principal Component Analysis, 1986, 115–128.
17.
JursP.C.BakkenG.A. and McClellandH.E., Computational methods for the analysis of chemical sensor array data from volatile analytes, Chemical Reviews100(7) (2000), 2649–2678.
18.
LiX.ChenM.NieF. and WangQ., Locality adaptive discriminant analysis, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017), 2017, pp. 2201–2207.
19.
MatsudaM.MorikuniK. and SakuraiT., Spectral feature scaling method for supervised dimensionality reduction, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), 2018, pp. 2560–2566.
20.
MorikuniK., Contour integral-type method for eigenproblems of rectangular matrix pencils, in: Book of Abstracts, Annual Meeting 2016 of the Japan Society for Industrial and Applied Mathematics, Kitakyushu, Japan, 2016, pp. 352–353.
21.
MuY., Fixed-rank supervised metric learning on riemannian manifold, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), 2016, pp. 1941–1947.
22.
NgA.Y.JordanM.I. and YairW., On spectral clustering: Analysis and an algorithm, in: Advances in Neural Information Processing Systems 14, 2002, pp. 849–856.
23.
ShiJ. and MalikJ., Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence22(8) (2000), 888–905.
24.
SoghT.S., Wired and wireless intrusion detection system: Classifications, good characteristics and state-of-the-art, Computer Standards & Interfaces28(6) (2006), 670–694.
25.
StrehlA. and GhoshJ., Cluster ensembles – a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research3 (2002), 583–617.
26.
SugiyamaM., Local Fisher discriminant analysis for supervised dimensionality reduction, in: Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 905–912.
27.
SugiyamaM., Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis, Journal of Machine Learning Research8 (2007), 1027–1061.
28.
TarcaA.L.RomeroR. and DraghiciS., Analysis of microarray experiments of gene expression profiling, American Journal of Obstetrics and Gynecology195(2) (2006), 373–388.
29.
TichyN.M.TushmanM.L. and FombrunC., Social network analysis for organizations, Academy of Management Review4(4) (1979), 507–519.
30.
van der MaatenL. and HintonG., Visualizing high-dimensional data using t-SNE, Journal of Machine Learning Research9 (2008), 2579–2605.
31.
WangQ.MengZ. and LiX., Locality adaptive discriminant analysis for spectral-spatial classification of hyperspectral images, IEEE Geoscience and Remote Sensing Letters14(11) (2017), 2077–2081.
32.
WeinbergerK.Q. and SaulL.K., Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research10 (2009), 207–244.
33.
YeX.JiK. and SakuraiT., Spectral clustering and discriminant analysis for unsupervised feature selection, in: Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2016, pp. 563–568.
34.
YeX.LiH.SakuraiT. and LiuZ., Large scale spectral clustering using sparse representation based on hubness, in: Proceeding of the IEEE International Conference on Cloud and Big Data Computing, 2018, pp. 1731–1737.
35.
YeX. and SakuraiT., Spectral clustering with adaptive similarity measure in Kernel space, Intelligent Data Analysis22(4) (2018), 751–765.
36.
Zelnik-ManorL. and PeronaP., Self-tuning spectral clustering, in: Proceedings of Advances in Neural Information Processing Systems 17, 2005, pp. 1601–1608.