A comparison study on nonlinear dimension reduction methods with kernel variations: Visualization,optimization and classification

Abstract

Because of high dimensionality, correlation among covariates, and noise contained in data, dimension reduction (DR) techniques are often employed to the application of machine learning algorithms. Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and their kernel variants (KPCA, KLDA) are among the most popular DR methods. Recently, Supervised Kernel Principal Component Analysis (SKPCA) has been shown as another successful alternative. In this paper, brief reviews of these popular techniques are presented first. We then conduct a comparative performance study based on three simulated datasets, after which the performance of the techniques are evaluated through application to a pattern recognition problem in face image analysis. The gender classification problem is considered on MORPH-II and FG-NET, two popular longitudinal face aging databases. Several feature extraction methods are used, including biologically-inspired features (BIF), local binary patterns (LBP), histogram of oriented gradients (HOG), and the Active Appearance Model (AAM). After applications of DR methods, a linear support vector machine (SVM) is deployed with gender classification accuracy rates exceeding 95% on MORPH-II, competitive with benchmark results. A parallel computational approach is also proposed, attaining faster processing speeds and similar recognition rates on MORPH-II. Our computational approach can be applied to practical gender classification systems and generalized to other face analysis tasks, such as race classification and age prediction.

Keywords

Dimension reduction PCA LDA FDA KPCA KFDA SKPCA SVM parameter optimization gender classification MORPH-II

1. Introduction

Due to advances in data collection and storage capabilities, the demand has been growing substantially for gaining insights into high-dimensional, complex-structured, and noisy data. Researchers from diverse areas have applied DR techniques to visualize and analyze such data [17, 60]. DR techniques are also helpful to address the issues of collinearity and “ $p\gg n$ ” (i.e., number of features exceeding the sample size in a dataset), by projecting the data into a lower dimensional space with less correlation, so that classical statistical methods can be applied [29]. Principal Component Analysis (PCA) [47, 28] is a well-studied algorithm with the goal of projecting input features onto a lower dimensional subspace while preserving the largest variance possible; lower dimensionality permits easier visualization, for example via heat maps. While PCA is a fully automatic algorithm, DR techniques that account for domain expertise via user input have also been more recently studied [73, 31]. For classification problems, in which the label information as the response variable is available, Linear Discriminant Analysis (LDA) (sometimes referred to as Fisher’s Discriminant Analysis (FDA)) can be used for DR by minimizing intra-class variation and maximizing inter-class variation [16, 51]. Since PCA only utilizes the correlation or covariance matrices, it is considered an unsupervised approach, whereas LDA is considered a supervised approach with labeling information built into its objective function. Despite the dissimilarities, both PCA and LDA search for linear combinations of the features and, therefore, can be applied in linearly separable types of problems [35]. The main challenge is that many problems in practical applications of machine learning are nonlinear [45, 79]. For nonlinear DR, kernel methods are popular choices because of their flexibility [59, 43, 70], e.g., Kernel PCA [56], Kernel LDA for two classes [42], and more generalized Kernel LDA for multiple classes [7]. For kernel methods, it is also possible to design specialized kernels based on domain knowledge of a problem [6, 55].

Given the problems in image analysis of high dimensionality and complex correlation structures, DR techniques are often a necessary step [27]. Thus, variants of PCA, LDA, and their kernel extensions have been popular in computer vision with applications of image classification and discrimination [82, 41, 74]. Studies include Eigenfaces [63], Fisherfaces [8], face recognition with KPCA [33], face recognition with Kernel Direct LDA [38], 2D-PCA [75], 2D-LDA [36], among many others. When there are sufficient labeled face images, LDA is experimentally reported to outperform PCA for face recognition [8]. In the case of a small training set, the conclusion could be reversed [41]. Studies comparing classification performance of PCA, LDA, and their kernel variations include [32, 77]. The connections among KLDA, KPCA, and LDA are further discussed in [72]. By incorporating labeling information into the construction of the objective function, Supervised Kernel PCA (SKPCA) [5] has been proposed for visualization, regression, and classification. A modified version of SKPCA for classification problems can be found in [65]. These studies suggest that SKPCA works well in practice among different DR algorithms [15, 53, 68]. Moreover, it has been found in [4] that with bounded kernels, projections from SKPCA are uniformly converging, regardless of the input features’ dimension.

2. Associated work

In recent years, facial demographic analysis has become popular in computer vision, because of its broad applications in human-computer interaction (HCI), security, surveillance, and marketing, which can benefit from the automatic estimation of characteristics like age, gender, and race. Recent surveys on demographic estimation from biometrics are presented in [18, 62]. Specifically, a major task is gender classification, aiming to automatically determine if a person is male or female. Beyond computer vision, the topic has been studied extensively by anthropologists, sociologists, and psychologists. Gender can easily be identified by humans, achieving 96% accuracy in an experiment classifying photographs of adult faces [9]. Automating gender classification has been a priority in real-world applications.

A number of biometrics have been used to identify gender, including face, voice, gait, handwriting, and even the iris [62]. However, gender classification from faces is the most common, probably because photography of faces is non-intrusive and ubiquitous. Ng et al. provide a survey of gender classification via face and gait [44].

Gender classification with faces launched in 1990, when neural networks were applied directly to pixels from face photographs [19, 12]. Many other early studies utilized the geometric-based approach to represent human faces, relying on measurements of facial landmarks [49, 67]. Though intuitive, such approaches are sensitive to the placement of landmarks, which can only accommodate frontal representations of the face, and may omit some important information from the face (such as texture of the skin). In recent years, the appearance-based methods have been more commonly adopted, which rely on a transformation of an image’s pixels [22, 24, 58]. Such methods capture both the geometric relationships of the face and texture information. However, a drawback is their sensitivity to illumination and viewpoint variations. Other issues are associated with the high dimensionality of the transformed pixels, which will be discussed further in the next paragraph. Some most recent gender classification studies involve convolutional neural networks (CNN) [71, 78, 3, 2]. Though CNNs have reached state-of-the-art accuracy rates, they are known to be less interpretable than some other methods.

Pixels often contain high redundancy and noise, which cannot be removed completely by pre-processing steps. Hence, the vectors resulting from appearance-based feature extraction methods genetically inherit redundancy and noise. Popular image feature extraction methods include local texture techniques such as local binary patterns (LBP) [76, 37, 40, 1], Gabor filters [69], biologically-inspired features (BIF) [21, 26], and histogram of oriented gradients (HOG) [21]. Such methods could lead to a high dimension of extracted features, thwarting practical applications by increasing runtime and memory consumption. When “ $p\gg n$ ”, for which the dimension of the feature space exceeds the sample size of the dataset, a fundamental assumption of many standard statistical procedures is violated. Additionally, collinearity of features can cause numerical problems, while noisy features can obscure true relationships with the response variable and hinder predictive performance. These significant issues motivate the use of DR techniques. The fundamental goal of DR is to extract and retain information in a lower dimensional space. Many of these methods fall under manifold learning, identifying a low-dimensional manifold embedded in a high-dimensional ambient space [39].

Even though PCA and LDA have been widely considered as popular and effective approaches for DR in machine learning, their kernel versions are much less investigated. To our best knowledge, KPCA, KLDA, and SKPCA have never before been directly compared on visualization and classification performance through simulations and practical applications to face image analysis problems.

Our main contributions in this study can be summarized as follows. (1) The nonlinear manifold learning projections for KPCA, KLDA, and SKPCA are directly compared with visualization through simulated datasets. (2) Motivated by the nonlinear nature of soft-biometric analysis problems, we utilize KPCA, KLDA, and SKPCA for dimension reduction on four types of appearance-based extracted features (BIF, HOG, LBP, and AAM) for the gender classification task. Moreover, the classification performance is compared systematically on parameter optimization. (3) For applications to practical large-scale systems, we propose an additional parallel computational framework that can decrease runtime while maintaining similar classification rates.

The remainder of the paper is structured as follows. In Section 3, we review the theory of KPCA, SKPCA, and KLDA. In Section 4, we conduct simulation studies to visualize projections. We propose our machine learning methods for gender classification on Morph-II in Section 5. The comparative performance of KPCA, SKPCA, and KLDA on Morph-II is presented and discussed in Section 6. The performance of these DR methods is further compared in Section 7 through application to the FG-NET dataset. The computational framework for large-scale practical systems is proposed in Section 8 and investigated on Morph-II. Finally, we conclude and offer future directions of research in Section 9.

3. Kernel-based dimension reduction methods

The nonlinearity in a classification problem can often be addressed by kernel-based DR methods, with the appropriate choice of kernels. The driving reasons are the nonlinearity of chosen kernels, flexibility of tuning parameter selection, and most importantly, the kernel tricks. Mercer’s theorem guarantees that a symmetric positive-definite function can be written as the sum of a convergent sequence of product functions, which potentially project the data into infinite-dimensional space [54]. Thus, it is feasible to separate the data in the new space. On the other hand, Representer Theorem shows that the solution for certain kernel methods lies in the finite-dimensional span of the training data [64, 54]. This is very helpful, since we do not need to compute the coordinates of the projected data in the infinite-dimensional space, but only the inner products between all pairs of data in the feature space.

3.1 Notations

With the goal of emphasizing the connections between KPCA, SKPCA, and KLDA, we define the following notations for classification problems.

Let $\mathcal{X}$ be the feature space, a non-empty subset in $\mathbb{R}^{p}$ with $p$ as the number of covariates for each subject. Let $\mathcal{Y}$ be the space for the response variable, a subset in $\mathbb{R}$ . Let $\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\}\subset\mathcal{X}\times\mathcal{Y}$ be a series of $n$ independent observations following a joint probability measure $P_{\mathcal{X},\mathcal{Y}}$ . Let $Y=[y_{1},y_{2},\cdots,y_{n}]^{T}$ denote the outcomes of the response variable. Let $X$ be an $n\times p$ feature matrix, with $x_{i}^{T}$ as the $i$ -th row for $i=1,\cdots,n$ , and $x^{(l)}\in\mathbb{R}^{n}$ for $l=1,\cdots,p$ as its $l$ -th column. Thus, the $X$ matrix can be written as:

$\displaystyle X=[x_{1},x_{2},\ldots,x_{n}]^{T}=[x^{(1)},x^{(2)},\ldots,x^{(p)}].$

Without loss of generality, we may assume that each column of the $X$ matrix is normalized, such that the mean of $x^{(l)}$ is 0 and standard deviation is 1, for $l=1,\ldots,p$ .

Let $\Sigma$ be the sample covariance matrix of $X$ .

We then have

$\displaystyle\underset{p\times p}{{\Sigma}}=\frac{1}{n-1}X^{T}X=\frac{1}{n-1}% \sum_{i=1}^{n}{x_{i}{x_{i}}^{T}}.$ (1)

Let $\mathcal{F}$ be a reproducing kernel Hilbert space on $\mathcal{X}$ from a kernel function $k(\cdot,\cdot)$ , which is a Mercer kernel (symmetric and positive-definite), and $\mathcal{G}$ be a reproducing kernel Hilbert space on $\mathcal{Y}$ from a kernel function $l(\cdot,\cdot)$ .

For the kernel $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ , its associated space $\mathcal{F}$ may be infinite-dimensional, but with some additional conditions, the minimizer of a regularized risk function lies in the finite span of the training observations [54]. Additionally, it has been shown [54] that there exists a function

$\displaystyle\phi:\mathcal{X}\rightarrow\mathcal{F}$ (2)

such that for all $x,x^{\prime}\in\mathcal{X}$ ,

$\displaystyle k(x,x^{\prime})=<\phi(x),\phi(x^{\prime})>,$ (3)

where $<\cdot>$ is the dot product. Let $K$ be a matrix such that its $i j$ -th element is $k(x_{i},x_{j})$ . We then have

$\displaystyle K=\{k(x_{i},x_{j})\}_{ij}=\{<\phi(x_{i}),\phi(x_{j})>\}_{ij}=% \Phi(X)\Phi(X)^{T},$ (4)

where $\Phi(X)=[\phi(x_{1}),\phi(x_{2}),\ldots,\phi(x_{n})]^{T}$ . Here, the kernel matrix $K$ is the Gram matrix of the $\phi(x_{i})$ ’s.

3.2 Principal component analysis and kernel principal component analysis

In standard PCA, we seek an orthogonal transformation matrix $A$ satisfying

$\displaystyle\underset{n\times d}{{T}}=\underset{n\times p}{{X}}\hskip 5.69055% 1pt\underset{p\times d}{{A}},$ (5)

where $T=[t_{1},t_{2},\ldots,t_{d}]$ for some $d\leqslant p$ , such that each column vector $t_{i}$ successively inherits maximal proportion of variance from the column vectors $x^{(l)}$ ’s, while ensuring the projection directions are perpendicular. The solutions can be expressed as the eigenvalue problem

$\displaystyle\Sigma a_{i}=\lambda_{i}a_{i},$ (6)

where $a_{i}$ is the $i$ -th column of $A$ , for $i=1,\ldots,d$ .

Following the work of [57], PCA can be extended to KPCA by first choosing a Mercer kernel $k$ , with which $x_{i}$ is transformed to $\phi(x_{i})$ . This maps the features in $X$ to $\Phi(X)$ . Assume that $\sum_{i=1}^{n}\phi(x_{i})$ is a vector with 0 in each entry. With the Gram matrix $K=\Phi(X)\Phi(X)^{T}$ as defined in Eq. (4) and through the kernel trick from Eq. (3), we have the eigenvalue problem:

$\displaystyle{K}a_{i}^{*}=\lambda_{i}^{*}a_{i}^{*},$ (7)

where $d$ is the desired dimension and $a_{1}^{*},\ldots,a_{d}^{*}$ are the eigenvectors of $K$ , with associated eigenvalues $\lambda_{1}^{*}\geqslant\lambda_{2}^{*}\geqslant\cdots\geqslant\lambda_{d}^{*}$ . Hence, the advantage of the kernel-based approach is to calculate the Gram matrix $K$ without an explicit expression for $\phi$ . Without the centralization assumption on $\phi$ , the $K$ matrix in Eq. (7) can be replaced by

$\displaystyle{K}^{*}=H_{n}KH_{n},$ (8)

where $d$ is the desired dimension, $H_{n}=I_{n}-\frac{1}{n}1_{n}$ , $I_{n}$ is an identity matrix with dimension $n\times n$ , and $1_{n}$ is a matrix of 1’s with dimension $n\times n$ .

We note that $H_{n}$ is idempotent, since it is a square matrix satisfying $H_{n}=H_{n}H_{n}$ . For any square matrix $S$ with dimension $n\times n$ , the average of each column of the matrix $H_{n}S$ is 0, as is the average of each row of the matrix $SH_{n}$ . Thus, the $K^{*}$ matrix is the “centralized” version of the original $K$ matrix.

3.3 Supervised kernel principal component analysis

PCA and KPCA are unsupervised methods, since they do not consider the response variable, only considering directions of maximum variability in the covariates. If the goal is classification, this may not be ideal, since the principal components may be unrelated to the class difference. SKPCA is a supervised generalization of KPCA, which aims to find the principal components with maximal dependence on the response variable. Drawing from [5, 65], we formulate SKPCA as follows.

In SKPCA, class information is incorporated by maximizing the Hilbert Schmidt independence criterion (HSIC) [20]. With the aforementioned reproducing kernel Hilbert spaces $\mathcal{F}$ on $\mathcal{X}$ and $\mathcal{G}$ on $\mathcal{Y}$ and related kernel functions $k(\cdot,\cdot)$ and $l(\cdot,\cdot)$ respectively, the HSIC can be expressed as

$\displaystyle\textit{HSIC}(P_{\mathcal{X},\mathcal{Y}},\mathcal{F},\mathcal{G}% )=E_{x,x^{\prime},y,y^{\prime}}[k(x,x^{\prime})l(y,y^{\prime})]+E_{x,x^{\prime% }}[k(x,x^{\prime})]E_{y,y^{\prime}}[l(y,y^{\prime})]-2E_{x,y}\big{(}E_{x^{% \prime}}[k(x,x^{\prime})]E_{y^{\prime}}[{l(y,y^{\prime})}]\big{)},$ (9)

where $E_{x,x^{\prime},y,y^{\prime}}$ represents the expectation on independent pairs of $(x,y)$ and $(x^{\prime},y^{\prime})$ (with respect to $P_{\mathcal{X},\mathcal{Y}}$ ) and $E_{x,x^{\prime}}$ and alike are the expectations based on various marginal distributions from $P_{\mathcal{X},\mathcal{Y}}$ .

With the results from [20], an empirical estimator of (9) is

$\displaystyle HSIC(X,Y,\mathcal{F},\mathcal{G})=\frac{1}{(n-1)^{2}}tr(KH_{n}LH% _{n}),$ (10)

where $K$ and $H_{n}$ are defined as before for KPCA and $L=\{1(y_{i}=y_{j})\}_{ij}$ is a link matrix with dimension $n\times n$ , where $1(\cdot)$ is an indicator function with value 1 if the event is true and 0 otherwise.

Similarly as for KPCA, $K$ and $L$ can be adjusted to satisfy the centralization assumption. As discussed previously, $H_{n}$ is an idempotent matrix. Therefore, following from Eq. (10),

$\displaystyle\textit{HSIC}^{*}(X,Y,\mathcal{F},\mathcal{G})=\frac{1}{(n-1)^{2}% }tr(KH_{n}H_{n}LH_{n}H_{n})=\frac{1}{(n-1)^{2}}tr(H_{n}KH_{n}H_{n}LH_{n})=% \frac{1}{(n-1)^{2}}tr(K^{*}L^{*}),$ (11)

where $K^{*}$ and $L^{*}$ are the “centralized” versions of the $K$ and $L$ matrices respectively.

On another note, in the binary gender classification problem, $\textrm{rank}(L)=2$ and $\textrm{rank}(KH_{n}LKH_{n})\leqslant 2$ [65]. Therefore, we modify the link matrix according to [65] by

$\displaystyle L=\{1(y_{i}=y_{j})\times k(x_{i},x_{j})\}_{ij}.$ (12)

It can be shown that maximization of Eq. (10) is equivalent to solving the generalized eigenvalue problem

$\displaystyle Av_{i}=\lambda_{i}Kv_{i},$ (13)

where $A=\textit{KH}_{n}\textit{LH}_{n}K$ and each $v_{i}$ is an eigenvector with related eigenvalue $\lambda_{i}$ for $i=1,\ldots,d$ , where $d$ is the desired dimension [65]. Therefore, the main advantage of the link matrix in Eq. (12) becomes apparent: the rank of $\textit{KH}_{n}\textit{LKH}_{n}$ may increase, permitting more eigenvalues to be computed.

3.4 Linear discriminant analysis and kernel linear discriminant analysis

Given a dataset with finite classes, LDA aims to find the best set of features to discriminate among the classes. We first review standard LDA, then generalize to KLDA. We note that sometimes parametric assumptions for LDA are made, such as that observations from each class are normally distributed with common covariance. Here, we make no such assumptions.

Suppose that each observation $x_{i}$ for $i=1,\ldots,n$ belongs to exactly one of $C$ classes. Define the following feature vectors: $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}$ as the overall mean and $\bar{x_{c}}=\frac{1}{n_{c}}\sum_{i=1}^{n}x_{i}1(x_{i}\in\text{class c})$ as the mean of the $c$ -th class with $n_{c}$ the size of the $c$ -th class in the sample, for $c=1,\ldots,C$ .

In standard LDA, we seek to maximize the objective function

$\displaystyle J(v)=\frac{v^{T}S_{B}v}{v^{T}S_{W}v},$ (14)

where $v$ is a $p\times 1$ vector, $S_{B}$ is the between-class scatter matrix, and $S_{W}$ is the within-class scatter matrix defined by

$\displaystyle\underset{p\times p}{S_{B}}=\sum_{c}n_{c}{(\bar{x_{c}}-\bar{x}){(% \bar{x_{c}}-\bar{x})}^{T}}\textrm{and}$ (15) $\displaystyle\underset{p\times p}{S_{W}}=\sum_{c}\sum_{i\in c}{(x_{i}-\bar{x_{% c}}){(x_{i}-\bar{x_{c}})^{T}}}.$

Hence, maximizing $J(v)$ involves finding some rotation of the scatter matrices such that the “distance” between the groups is maximized relative to the variations within each group.

Maximization of $J(v)$ in Eq. (14) is equivalent to solving the generalized eigenvalue problem

$\displaystyle S_{B}v_{i}=\lambda_{i}{S_{W}}v_{i},$ (16)

where each $v_{i}$ is an eigenvector with corresponding eigenvalue $\lambda_{i}$ , for $i=1,\ldots,d$ , where $d$ is the desired dimension.

LDA is generalized to KLDA using the kernel representation from Eq. (3). Analogously to LDA above, we seek a solution $v^{*}$ that will result in the maximization of the objective function

$\displaystyle J(v)=\frac{v^{T}S_{B}^{*}v}{v^{T}S_{w}^{*}v},$ (17)

where now $v\in\mathcal{F}$ and $S_{B}^{*}$ and $S_{W}^{*}$ are the between-class and within-class scatter matrices in $\mathcal{F}$ defined by

$\displaystyle m^{\phi}=\frac{1}{n}\sum_{i=1}^{n}\phi(x_{i}),$ $\displaystyle m_{c}^{\phi}=\frac{1}{n_{c}}\sum_{i=1}^{n}\phi(x_{i})1(x_{i}\in% \text{class c}),$ $\displaystyle{S_{B}^{*}}=\sum_{c}n_{c}{(m_{c}^{\phi}-m^{\phi}){(m_{c}^{\phi}-m% ^{\phi})}^{T}},\textrm{ and}$ (18) $\displaystyle{S_{W}^{*}}=\sum_{c}\sum_{i\in c}{(\phi(x_{i})-m_{c}^{\phi}){(% \phi(x_{i})-m_{c}^{\phi})^{T}}}.$

The above expressions involve knowledge of $\phi$ , which is often not available. It can be shown that Eq. (17) is equivalent to

$\displaystyle J(u)=\frac{u^{T}\textit{Mu}}{u^{T}\textit{Nu}},$ (19)

where

$\displaystyle M_{c}=(M_{cj})_{j}=\left(\frac{1}{n_{c}}\sum_{h=1}^{n}k(x_{j},x_% {h})1(x_{h}\in\text{class c})\right)_{j},$ $\displaystyle\bar{M}=(\bar{M}_{j})_{j}=\left(\frac{1}{n}\sum_{h=1}^{n}k(x_{j},% x_{h})\right)_{j},$ $\displaystyle M=\sum_{c}n_{c}{(M_{c}-\bar{M}){(M_{c}-\bar{M})}^{T}},$ (20) $\displaystyle K_{c}=K\times outer(X,X_{c}),$ $\displaystyle N=\sum_{c}K_{c}H_{n_{c}}K_{c}^{T},$

$X_{c}$ is a matrix of dimension $n_{c}\times p$ with rows being features from the $c$ -th class, and $outer(X,X_{c})$ is an $n\times n_{c}$ matrix with its $i j$ -th element as 1( $x_{i}$ is the $j$ -th observation in class c). A full discussion of KLDA can be found in [42].

Maximization of $J(u)$ in Eq. (19) is equivalent to solving the generalized eigenvalue problem

$\displaystyle\textit{Mu}_{i}=\lambda_{i}\textit{Nu}_{i},$ (21)

where each $u_{i}$ is an eigenvector with associated eigenvalue $\lambda_{i}$ , for $i=1,\ldots,d$ with $d$ as the desired dimension.

Comparing the generalized eigenvalue problems in Eqs (16) and (21), the structures of matrices $S_{B}$ and $M$ are similar, since both “measure” the variation between different classes.

Let $W_{c}=[w_{c,1},\ldots,w_{c,n_{c}}]=K_{c}H_{n_{c}}$ , a matrix of dimension $n\times n_{c}$ . Due to the centralization function of $H_{n_{c}}$ , $W_{c}$ has row-sum equal to zero for every row. Besides, $K_{c}H_{n_{c}}(K_{c}H_{n_{c}})^{T}=W_{c}W_{c}^{T}=\sum_{i=1}^{n_{c}}w_{c,i}w_{% c,i}^{T}$ . For the matrix $N$ , due to the idempotent property of $H_{n_{c}}$ ,

$\displaystyle N=\sum_{c}K_{c}H_{n_{c}}H_{n_{c}}K_{c}^{T}=\sum_{c}\sum_{i=1}^{n% _{c}}w_{c,i}w_{c,i}^{T}.$ (22)

Thus, the matrix $N$ has an identical structure to the $S_{W}$ and $S_{W}^{*}$ matrices, which “measure” the overall variation within groups.

4. Visualization on simulation studies

To visualize and improve understanding of the manifold learning methods KPCA, SKPCA, and KLDA, we apply them in three simulation studies. For comparison, the linear techniques PCA and LDA are also considered. Each dataset contains nonlinear patterns, and the goal is to transform the data to be linearly separable. For this reason, the radial basis function (RBF)

$\displaystyle k(x_{i},x_{j})=e^{-\delta{||x_{i}-x_{j}||}_{2}^{2}}$ (23)

is chosen as a kernel for each pair of observed vectors $x_{i},x_{j}$ . For each DR method, a range of values for the tuning parameter $\delta$ are tested and selected to visually separate the classes. A full discussion of the RBF kernel, among others, can be found in [61]. Figures 1–3 compare the original data to 2-dimensional projections from each DR method. In each plot, color corresponds to the true class to which an observation belongs.

Figure 1.

Wine chocolate simulation study.

In the first simulation study, the original data are plotted in 3D in Fig. 1a; the green sphere is embedded within the magenta group, necessitating nonlinear manifold learning. The KLDA projections in Fig. 1b are linearly separable with very good variation between the classes and a fair amount of variation within the classes. KPCA and SKPCA projections in Fig. 1c and d are at least approximately linearly separable, as it is not clear whether there is a linear boundary that perfectly separates the two classes. In Fig. 1e, PCA fails to linearly separate the groups, rotating the wine chocolate in 2D. The maximum dimension LDA can retain is $p-1$ ; with 2 classes, the projections must be plotted on a 1D number line, given in Fig. 1f. Points from the two classes overlap considerably in Fig. 1e and f.

Figure 2.

Apple tart simulation study.

In the second simulation study, the original data in Fig. 2a follow a nonlinear pattern. In Fig. 2b, KLDA produces groups which are linearly separable. The KPCA projections are approximately linearly separable in Fig. 2c; however, there is some overlap between groups, especially the green and pink groups in the third quadrant. In Fig. 2d, SKPCA produces almost linearly separable groups. In Fig. 2e and f, PCA and LDA simply rotate the original data in 2D space, as expected.

Figure 3.

Swiss roll simulation study.

For the third simulation study, the original data in Fig. 3a are in 3D and follow a swirling, nonlinear pattern. In Fig. 3b, KLDA yields favorable results; the groups are well-separated linearly. KPCA and SKPCA in Fig. 3c and d also produce good results, although in Fig. 3c more separation between the purple and bright green groups would be ideal. In Fig. 3e and Fig. 3f, respectively, PCA and LDA merely rotate the original data projected in 2D space; there is no linear separation between the magenta and purple groups, nor between the two green groups.

For all three simulation studies, KLDA, KPCA, and SKPCA are effective to transform the data into linearly separable groups. In all cases, the projected data become approximately linearly separable after applying KLDA, KPCA, or SKPCA. In general, KLDA and SKPCA perform the best here. Their success over KPCA is expected, since KLDA and SKPCA are supervised techniques. On the other hand, results indicate that KPCA and SKPCA are more sensitive than KLDA to different choices of tuning parameters. Hence, SKPCA and KPCA may perform better for alternative choices of parameters. In all our studies, the nonlinear techniques outperform linear PCA and LDA. These preliminary studies suggest the radial kernel is appropriate for our face analysis experiments.

5. Kernel-based dimension reduction optimization and classification on Morph-II

Motivated by the nonlinear nature of facial demographic analysis, we propose and implement a novel machine learning process for the Morph-II dataset. We consider the kernel-based DR methods KPCA, SKPCA, and KLDA on three types of appearance-based extracted features (LBP, BIF, and HOG) for the gender classification task. We illustrate parameter optimization and compare the performance of these methods on Morph-II.

5.1 Longitudinal face database

MORPH [52] is one of the most popular face databases available to the public, especially for age estimation, race classification, and gender classification. Multiple versions of MORPH have been released, and the version adopted in this work is the 2008 MORPH-II non-commercial release (referred to as Morph-II in this paper). Morph-II includes over 55,000 mugshots with longitudinal spans and metadata such as date of birth, race, gender, and age.

In addition to its size, Morph-II presents challenges because of disproportionate race and gender ratios. About 84.6% of images are of males, while only about 15.4% of images are of females. Imbalanced classes are known to negatively affect certain classification algorithms [30]. Moreover, Morph-II is skewed in terms of race, with approximately 77.2% of images picturing black subjects. Guo et al. found that age, gender, and race interact for demographic analysis tasks including gender classification, race classification, and age prediction [21, 23, 22], so both race and gender imbalance in Morph-II can hamper gender classification.

5.2 Subsetting scheme

To overcome the uneven race and gender distributions in Morph-II, Guo et al. proposed a subsetting scheme [22]. Since then, many studies on Morph-II have adopted such an evaluation protocol. Based on discussions in Guo et al. [22], a new automatic subsetting scheme is proposed in [80], aiming to automatically ensure independent training and testing sets. Additionally, inconsistencies in age, gender, and race in Morph-II have been identified and corrected in [80]. After following the steps to clean MORPH-II outlined in [80], we apply the automatic subsetting scheme, summarized in Fig. 4 and described below.

Figure 4.

Flowchart representing our subsetting scheme [80] for MORPH-II, which improves the one from [22].

Let $W$ be the Whole Morph-II dataset, $S$ the selected training/testing set, and $R$ the remaining set. We further divide $S$ into even subsets $S_{1}$ and $S_{2}$ . Separately within each subset $S_{1}$ and $S_{2}$ , we fix the ratios of white (W) to black (B) images as 1:1 and male (M) to female (F) images as 3:1. Further, $S_{1}$ and $S_{2}$ have been selected such that the age distributions within each set are similar (details shown in [80]). The gender and race summaries for the subsetting scheme are shown in Table 1. Most importantly, the sets $R,S_{1},$ and $S_{2}$ are independent; no sets share images from the same subject. We use $S$ as an alternating training and testing set. First, we train on $S_{1}$ and test on $S_{2}\cup R$ , then we train on $S_{2}$ and test on $S_{1}\cup R$ . The final classification accuracy is obtained by averaging the classification accuracies from the alternations.

Table 1

Number of images in subsets by race and gender

	WF	BF	WM	BM	dF	dM	Overall	F	M
S1	1,285	1,285	3855	3,855	0	0	10,280	2570	7,710
S2	1,285	1,285	3,855	3,855	0	0	10,280	2,570	7,710
R	0	3,150	220	28,980	144	1,850	34,344	3,294	31,050
Overall	2,570	5,720	7,930	36,690	144	1,850	54,904	8,434	46,470

Race-gender combinations are abbreviated, e.g., BF represents the black female group. Abbreviations dF and dM represent those who are neither black nor white in race.

5.3 Facial feature extraction

In computer vision, image preprocessing is often an essential first step to reduce unnecessary variation, decrease pixel dimension, and simplify pixel encoding. Despite the standard format of police photography in mugshots, Morph-II photographs vary in head-tilt, camera distance, occlusion, and illumination. We address this variation as follows. Images are first converted to grayscale. Next, faces are automatically detected, eliminating the background and hair, so that no external cues can be used to classify gender. The resulting images are centered and scaled with respect to the center of the irises. Finally, the images are cropped to be 70 pixels tall by 60 pixels wide. Full methodological details are given in [81] and align with standard preprocessing protocols from face analysis.

After preprocessing, pixel-related features are extracted from the face images in Morph-II. As discussed previously, there are numerous approaches for feature extraction. In this study on Morph-II, we incorporate domain expertise by choosing three well-established appearance-based models from image analysis: local texture techniques such as local binary patterns (LBP) [76, 37, 40, 1], biologically-inspired features (BIF) [21, 26], and histogram of oriented gradients (HOG) [21]. Additionally, these model-based approaches provide “robust interpretation $\ldots$ by constraining solutions to be face-like” [14]. Detailed documentation of our feature extraction process can be found in [34, 81].

Table 2
Parameter summary

Features	LBP	$s=$ 10, 12, 14, 16, 18, 20
		$r=$ 1, 2, 3
	HOG	$s=$ 4, 6, 8, 10, 12, 14
		$o=$ 4, 6, 8
	BIF	$s=$ 7–37,15–29
		$\gamma=$ 0.1, 0.2, $\ldots$ ,1.0
Dimension reduction	KPCA	$\delta=$ $\pm$ 0.1, $\pm$ 1, $\pm$ 5, $\pm$ 10, $\pm$ 100
	SKPCA	$\delta=$ $-$ 0.0001, $-$ 0.001
		$\eta=$ 0.001, 0.01, 0.1, 1
	KLDA	$\delta=$ $\pm$ 0.01, $\pm$ 0.1, $\pm$ 1, $\pm$ 5, $\pm$ 10, $\pm$ 100
Classifier	Linear SVM	$c=10^{-8}$ , $\ldots$ ,10 ${}^{-1}$ ,1,10, $\ldots$ ,10 ${}^{8}$

5.4 Kernel-based dimension reduction optimization

Tuning parameter selection is essential for kernel-based methods in order to achieve good results. Within the framework of feature extraction, dimension reduction, and the classification model, there are many combinations of parameters to be considered. The main parameters and tested values are summarized in Table 2 and discussed as follows. LBP features have two main parameters: block size $s$ and window radius $r$ . For HOG, the two main parameters are block size $s$ and number of orientations $o$ . For BIF, we consider the block size $s$ and the parameter $\gamma$ , which represents the spatial aspect ratio; there is also a choice of pooling operation, which we select here as the standard deviation operation.

For each dimension reduction method, the radial kernel

$\displaystyle k(x_{i},x_{j})=e^{\delta{||x_{i}-x_{j}||}_{2}^{2}}$ (24)

is used for each pair of observation vectors $x_{i},x_{j}$ , based on the results from our simulation studies. In the kernel, we must select the tuning parameter $\delta$ , which scales the extent of similarity between pairs of vectors. This parameter must be chosen with particular care, since a poor choice can result in transformed features with little to no variability. Empirically, we observed that SKPCA was more sensitive than KLDA and KPCA to the choice of $\delta$ ; values of $\delta$ at or above 1 resulted in a rank deficient matrix and failure to compute all requested eigenvalues. For SKPCA, we consider an additional scaling parameter $\eta$ in the modified link function proposed by Wang et al. [65]:

$\displaystyle l(y_{i},y_{j})=e^{\eta\delta{||x_{i}-x_{j}||}_{2}^{2}},$ (25)

for all observed responses $y_{i},y_{j}$ in the same class. The scale parameter $\eta$ enables the weighing of dependence between the covariates and response.

Finally, we choose a linear SVM to classify gender based on the dimension-reduced, transformed features. The motivation for this classifier is discussed in the next section. The main parameter for linear SVM is the cost $c$ , which measures the extent to which misclassification in training will be permitted. We consider values of $c$ from $10^{-8}$ to $10^{8}$ .

Table 3

Tuning results on a subset of MORPH-II

Method	Feature	Parameters	Accuracy
KPCA	BIF $s=$ 7–37, $\gamma=$ 0.1	$\delta=$ $-$ 1, $c=$ 10	0.882
	BIF $s=$ 7–37, $\gamma=$ 0.6	$\delta=$ $-$ 1, $c=$ 10	0.882
	BIF $s=$ 15–29, $\gamma=$ 0.1	$\delta=$ $-$ 1, $c=$ 100	0.882
	BIF $s=$ 15–29, $\gamma=$ 0.6	$\delta=$ $-$ 1, $c=$ 10	0.882
	HOG $s=$ 4, $o=$ 4	$\delta=$ $-$ 100, $c=$ 0.1	0.917
	HOG $s=$ 4, $o=$ 4	$\delta=$ $-$ 5, $c=$ 0.001	0.919
	HOG $s=$ 4, $o=$ 4	$\delta=$ $-$ 1, $c=$ 0.001	0.917
	HOG $s=$ 4, $o=$ 4	$\delta=$ $-$ 0.1, $c=$ 0.1	0.917
	LBP $s=$ 10, $r=$ 1	$\delta=$ $-$ 100, $c=$ 0.1	0.912
	LBP $s=$ 10, $r=$ 1	$\delta=$ $-$ 5, $c=$ 0.1	0.912
	LBP $s=$ 10, $r=$ 1	$\delta=$ $-$ 1, $c=$ 0.001	0.912
	LBP $s=$ 10, $r=$ 1	$\delta=$ $-$ 0.1, $c=$ 0.1	0.912
SKPCA	BIF $s=$ 7–37, $\gamma=$ 0.2	$\delta=$ 0.0001, $\eta=$ 0.1, $c=$ 1	0.899
	BIF $s=$ 7–37, $\gamma=$ 0.8	$\delta=$ 0.0001, $\eta=$ 0.1, $c=$ 1	0.899
	BIF $s=$ 15–29, $\gamma=$ 0.5	$\delta=$ 0.0001, $\eta=$ 0.1, $c=$ 1	0.899
	HOG $s=$ 6, $o=$ 6	$\delta=$ 0.0001, $\eta=$ 0.001, $c=$ 1	0.931
	HOG $s=$ 6, $o=$ 6	$\delta=$ 0.0001, $\eta=$ 0.01, $c=$ 0.001	0.931
	HOG $s=$ 6, $o=$ 6	$\delta=$ 0.0001, $\eta=$ 0.1, $c=$ 0.001	0.931
	LBP $s=$ 14, $r=$ 2	$\delta=$ 0.0001, $\eta=$ 0.001, $c=$ 1	0.937
	LBP $s=$ 14, $r=$ 2	$\delta=$ 0.0001, $\eta=$ 0.01, $c=$ 1	0.937
	LBP $s=$ 14, $r=$ 2	$\delta=$ 0.0001, $\eta=$ 0.1, $c=$ 1	0.938
	LBP $s=$ 14, $r=$ 2	$\delta=$ 0.0001, $\eta=$ 1, $c=$ 1	0.939
KLDA	BIF $s=$ 7–37, $\gamma=$ 0.3	$\delta=$ $-$ 1, $c=$ 10	0.875
	BIF $s=$ 7–37, $\gamma=$ 0.6	$\delta=$ $-$ 1, $c=$ 100	0.875
	BIF $s=$ 15–29, $\gamma=$ 0.2	$\delta=$ $-$ 1, $c=$ 10	0.875
	BIF $s=$ 15–29, $\gamma=$ 0.8	$\delta=$ $-$ 1, $c=$ 100	0.875
	HOG $s=$ 4, $o=$ 4	$\delta=$ 1, $c=$ 1	0.917
	HOG $s=$ 4, $o=$ 6	$\delta=$ 1, $c=$ 1	0.917
	HOG $s=$ 12, $o=$ 8	$\delta=$ $-$ 1, $c=$ 100	0.904
	LBP $s=$ 10, $r=$ 1	$\delta=$ $-$ 0.1, $c=$ 1	0.906
	LBP $s=$ 10, $r=$ 1	$\delta=$ 1, $c=$ 1	0.908
	LBP $s=$ 14, $r=$ 1	$\delta=$ 0.1, $c=$ 10	0.898

We tune on small subsets of Morph-II to reduce runtime, memory consumption, and risk of over-fitting. 1000 images from $S_{1}$ and 1000 images from $S_{2}$ are randomly selected. The standard method of grid search is followed for tuning on these subsets. For each combination of parameters, a model is trained on the subset from $S_{1}$ and then tested on the subset from $S_{2}$ . For each dimension reduction method paired with each feature type (BIF, HOG, and LBP), the best three or four accuracy rates from tuning are obtained. (Except in the case of ties, we choose only the top three accuracy rates.) The tuning results for these top-performing parameters are given in Table 3. The parameters corresponding to these maximum accuracy rates are applied to the full dataset through the previously discussed evaluation protocol. Although this protocol involves testing on images from $S_{1}$ and $S_{2}$ , any overlap of images is minor (in each testing set, less than 2.3% of images have been used in tuning) and has little impact on the reported accuracy (discussed in Section 6). Regardless, the tuning parameters are selected through the same procedure, so the classification performances can be fairly compared among all considered DR methods.

Table 4

Gender classification results on MORPH-II

Method	Feature	Parameters	Acc ${}^{(1)}$	TPR ${}^{(2)}$	TNR ${}^{(3)}$	Mem ${}^{(4)}$	Time ${}^{(5)}$
KPCA	BIF $s=$ 7–37, $\gamma=$ 0.1	$\delta=$ $-$ 1, $c=$ 10	0.9296	0.9473	0.8127	34.04	42.26
	BIF $s=$ 7–37, $\gamma=$ 0.6	$\delta=$ $-$ 1, $c=$ 10	0.9297	0.9455	0.8112	34.68	36.94
	BIF $s=$ 15–29, $\gamma=$ 0.1	$\delta=$ $-$ 1, $c=$ 100	0.9071	0.9377	0.7050	31.74	33.83
	BIF $s=$ 15–29, $\gamma=$ 0.6	$\delta=$ $-$ 1, $c=$ 10	0.9096	0.9374	0.7266	31.80	35.97
	HOG $s=$ 4, $o=$ 4	$\delta=$ $-$ 100, $c=$ 0.1	0.9391	0.9726	0.7172	34.00	31.54
	HOG $s=$ 4, $o=$ 4	$\delta=$ $-$ 5, $c=$ 0.001	0.9391	0.9727	0.7170	34.00	30.86
	HOG $s=$ 4, $o=$ 4	$\delta=$ $-$ 1, $c=$ 0.001	0.9391	0.9724	0.7192	34.00	32.17
	HOG $s=$ 4, $o=$ 4	$\delta=$ $-$ 0.1, $c=$ 0.1	0.9364	0.9626	0.7634	34.35	31.41
	LBP $s=$ 10, $r=$ 1	$\delta=$ $-$ 100, $c=$ 0.1	0.9391	0.9726	0.7172	34.00	31.54
	LBP $s=$ 10, $r=$ 1	$\delta=$ $-$ 5, $c=$ 0.1	0.9391	0.9726	0.7172	34.00	30.86
	LBP $s=$ 10, $r=$ 1	$\delta=$ $-$ 1, $c=$ 0.001	0.9391	0.9724	0.7192	34.00	32.17
	LBP $s=$ 10, $r=$ 1	$\delta=$ $-$ 0.1, $c=$ 0.1	0.9364	0.9626	0.7634	34.35	31.41
SKPCA	BIF $s=$ 7–37, $\gamma=$ 0.2	$\delta=$ 0.0001, $\eta=$ 0.1, $c=$ 1	0.9507	0.9616	0.8781	35.48	42.04
	BIF $s=$ 7–37, $\gamma=$ 0.8	$\delta=$ 0.0001, $\eta=$ 0.1, $c=$ 1	0.9532	0.9639	0.8823	33.04	38.34
	BIF $s=$ 15–29, $\gamma=$ 0.5	$\delta=$ 0.0001, $\eta=$ 0.1, $c=$ 1	0.9260	0.9477	0.7827	20.03	34.58
	HOG $s=$ 6, $o=$ 6	$\delta=$ 0.0001, $\eta=$ 0.001, $c=$ 1	0.9467	0.9645	0.8292	36.69	37.39
	HOG $s=$ 6, $o=$ 6	$\delta=$ 0.0001, $\eta=$ 0.01, $c=$ 0.001	0.9489	0.9786	0.7528	38.28	53.96
	HOG $s=$ 6, $o=$ 6	$\delta=$ 0.0001, $\eta=$ 0.1, $c=$ 0.001	0.9488	0.9786	0.7517	39.83	60.55
	LBP $s=$ 14, $r=$ 2	$\delta=$ 0.0001, $\eta=$ 0.001, $c=$ 1	0.9585	0.9727	0.8641	28.68	25.33
	LBP $s=$ 14, $r=$ 2	$\delta=$ 0.0001, $\eta=$ 0.01, $c=$ 1	0.9585	0.9764	0.8642	23.22	38.42
	LBP $s=$ 14, $r=$ 2	$\delta=$ 0.0001, $\eta=$ 0.01, $c=$ 1	0.9585	0.9730	0.8640	29.87	28.00
	LBP $s=$ 14, $r=$ 2	$\delta=$ 0.0001, $\eta=$ 1, $c=$ 1	0.9585	0.9727	0.8640	27.92	22.83
KLDA	BIF $s=$ 7–37, $\gamma=$ 0.3	$\delta=$ $-$ 1, $c=$ 10	0.9415	0.9539	0.8594	24.89	34.50
	BIF $s=$ 7–37, $\gamma=$ 0.6	$\delta=$ $-$ 1, $c=$ 100	0.9426	0.9558	0.8858	24.74	35.46
	BIF $s=$ 15–29, $\gamma=$ 0.2	$\delta=$ $-$ 1, $c=$ 10	0.9131	0.9374	0.7532	22.80	26.78
	BIF $s=$ 15–29, $\gamma=$ 0.8	$\delta=$ $-$ 1, $c=$ 100	0.9205	0.9421	0.7783	22.83	33.88
	HOG $s=$ 4, $o=$ 4	$\delta=$ 1, $c=$ 1	0.9369	0.9517	0.8392	36.52	81.71
	HOG $s=$ 4, $o=$ 6	$\delta=$ 1, $c=$ 1	0.9398	0.9545	0.8425	52.48	148.24
	HOG $s=$ 12, $o=$ 8	$\delta=$ $-$ 1, $c=$ 100	0.9175	0.9421	0.7542	21.18	21.57
	LBP $s=$ 10, $r=$ 1	$\delta=$ $-$ 0.1, $c=$ 1	0.9418	0.9578	0.8428	24.58	37.17
	LBP $s=$ 10, $r=$ 1	$\delta=$ 1, $c=$ 1	0.9417	0.9558	0.8486	24.70	36.45
	LBP $s=$ 14, $r=$ 1	$\delta=$ 0.1, $c=$ 10	0.9392	0.9543	0.8397	20.77	31.12

(1) Acc represents mean accuracy. (2) TPR represents mean true positive rate (recall/sensitivity): the proportion of male faces correctly classified. (3) TNR represents mean true negative rate (specificity): the proportion of female faces correctly classified. (4) Mem represents mean memory in gigabytes. (5) Time represents mean runtime in hours for training and testing.

5.5 Gender classification

For the classification part of the modeling, linear SVM is adopted. Many face analysis studies have involved SVM, as summarized in [10]. Briefly, SVM identifies a separating hyperplane with maximal margin between the classes. Several popular kernels for SVM include linear, polynomial, and RBF [61]. We select the linear kernel, because directions of variability in the data are expected to be linear after the nonlinear transformations of KPCA, SKPCA, or KLDA. Indeed, Schölkopf et al. observed this to be true for KPCA in their landmark study [57]. The linear kernel for SVM also reduces computational cost, compared to nonlinear kernels.

With the parameters in Table 3 that are selected from tuning on subsets, we implement dimension reduction and classification on the full Morph-II dataset, following the subsetting scheme discussed in Section 5.2. The challenges of the large size of Morph-II, the high dimensionality of the features, and the computational complexity of these dimension reduction methods necessitate the use of high-performance computing (HPC). For example, the kernel matrix for each dimension reduction method is 55134 $\times$ 55134, requiring approximately 23 gigabytes of storage. Thus, we implement the process on the HiPerGator 2.0 supercomputing cluster at the University of Florida. The code is written in R. The R package rARPACK is used to optimize the solving of eigenvalue problems [50], and the e1071 package is utilized for training and testing the SVM model [13].

6. Experiment results

The kernel-based DR methods KPCA, SKPCA, and KLDA are applied to three facial feature extraction methods: BIF, HOG, and LBP. The DR methods transform the feature data, then reduce the dimension. In all cases, a dimension of 100 is retained, substantially lower than the dimension of the original feature space. The dimensionality of 100 is selected as a trade-off between computation time and classification accuracy based on our preliminary studies. The transformed and dimension-reduced data serve as input for the linear SVM, which classifies each image subject as male or female. Additionally, these predicted gender classes are mapped to probabilities through a sigmoid function, following [48]. This process is applied to each alternation of the evaluation protocol: 1) train on $S_{1}$ , test on $S_{2}\cup R$ and 2) train on $S_{2}$ , test on $S_{1}\cup R$ . The classification results are averaged over these two testing sets. The mean classification accuracy over the testing images is chosen as the evaluation criterion for our methods on Morph-II, as it is the usual performance metric for gender classification [21], especially in similar studies [24, 25, 78, 71].

These mean classification results from Morph-II are shown in Table 4. In addition to the accuracy, the true positive rate (also known as sensitivity or recall) and true negative rate (also called specificity) are given. For this study, we define the true positive rate (TPR) as the proportion of male faces correctly classified, while the true negative rate (TNR) as the proportion of female faces correctly classified. The memory and runtime are also listed in Table 4. The runtime is the total time for training and testing on HPC, i.e., the average of time1 (train on $S_{1}$ , test on $S_{2}\cup R$ ) and time2 (train on $S_{2}$ , test on $S_{1}\cup R$ ). As mentioned in Section 5.4, there is a small overlap between the tuning and testing sets that could contribute to over-fitting. We have assessed the potential impact of over-fitting on our reported accuracy rates and found it to be very small: it is estimated to be (at most) between 0.09% and 0.2% and to monotonically decrease as reported accuracy rates increase.

Figure 5.

Receiver operating characteristic (ROC) curve and area under the curve (AUC) are compared by method for gender classification on Morph-II. Each color corresponds to a DR method paired with feature type. For each probability threshold, the true and false positive rates are reported as the averages from the testing sets of the alternating evaluation protocol.

The classification performance is further visualized in Fig. 5 through receiver operating characteristic (ROC) curves for the nine combinations of DR method and feature extraction type. For each combination, its displayed curve corresponds to the “best” results from Table 4 (the combination of parameters reaching maximum mean classification accuracy or maximum mean true positive rate in the event of ties). For each alternation of the evaluation protocol, the true and false positive rates in testing are calculated for each probability threshold. To construct the ROC curves, each of the resulting rates for each threshold is averaged over the testing sets.

Table 4 shows that for the feature BIF, SKPCA and KLDA outperform KPCA. For the feature HOG, SKPCA achieves higher accuracy than both KPCA and KLDA, while the latter two techniques perform very similarly. Last, for the feature LBP, SKPCA produces better classification accuracy than KPCA and KLDA. In summary, our experiment’s results indicate that SKPCA outperforms KLDA consistently, while KLDA outperforms KPCA for all three features BIF, LBP, and HOG. On the other hand, for KPCA, the features HOG and LBP produce approximately the same accuracies, outperforming BIF. For SKPCA, LBP achieves slightly better results than BIF, while LBP and BIF both outperform HOG. Finally, for KLDA, BIF reaches slightly higher accuracy than LBP, while BIF and LBP both exceed HOG.

In most cases, the accuracy (in Table 4) and AUC (in Fig. 5) metrics agree on the best methods. An exception is that SKPCA with the HOG features achieves slightly higher accuracy (94.89%) than KLDA with the BIF features (94.18%), but SKPCA with HOG has lower AUC than KLDA with BIF. The other exception is that KPCA with the HOG features has the lowest AUC of the nine methods, but its accuracy is higher than KPCA with the BIF features. In summary, the accuracy and AUC results imply that SKPCA generally performs best for gender classification on Morph-II, while KLDA tends to outperform KPCA. Meanwhile, the LBP and BIF features often yield better classification performance, with less memory usage, than the HOG features.

It is interesting that, overall, LBP achieves even slightly better performance than BIF for the dimension reduction method SKPCA on the task of gender classification, since BIF is popular in demographic analysis such as age estimation, gender classification, and race classification [21, 23, 22, 24, 25]. Another interesting fact is displayed by the results of the true positive and negative rates in Table 4: males have a higher probability of correct identification than females, with the biggest margin exceeding 20%. Our finding is consistent with [26]: females are more challenging to correctly classify than males, both for automatic approaches and human perception. Similarly, for race classification on Morph-II, Guo and Mu found in [23] that training a model on female faces (and testing on male faces) contributed to significantly more errors on average than training on male faces (and testing on female faces), even when controlling for differences in the training sample sizes. Our results also indicate that, overall, HOG and LBP outperform BIF for males, while BIF works consistently better than LBP and HOG for females.

Next, in Table 5 we compare our results to studies using similar methods on Morph-II, as well as recent state-of-the-art works with deep learning on MORPH-II. With the exception of [26], all studies’ results in the table are mean testing classification accuracy from an alternating evaluation protocol based on Guo et al [22]. Hence, our results can be directly compared to these studies. With LBP features, SKPCA, and a linear support vector machine (SVM), our gender classification accuracies approximate 96%, competitive with benchmark results. Interestingly, several reported accuracy rates from human observers of gender range from 96% [9] to 96.9% [26]. The similarity in recognition rates between our methods and human observers can further validate the success of our approach.

Table 5

Comparison results for gender classification on MORPH-II

Method	Accuracy	Reference	Year
BIF $+$ OLPP	98%	[24]	2011
BIF $+$ PLS	97.34%	[24]	2011
BIF $+$ KPLS	98.2%	[24]	2011
BIF $+$ CCA	95.2%	[25]	2014
BIF $+$ KCCA	98.4%	[25]	2014
BIF A $+$ rCCA	97.6%	[25]	2014
Multi-scale CNN	97.9%	[78]	2014
Ranking CNN	97.9%	[71]	2015
BIF $+$ Hierarchical-SVM	97.6%	[26]	2015
Human Estimators	96.9%	[26]	2015
LBP $+$ SKPCA $+$ L-SVM	95.85%	This work	2019

7. Kernel-based dimension reduction optimization and classification on FG-NET

For further comparison between KPCA, SKPCA, and KLDA, we apply a modification of our approach from Section 5 to a smaller face dataset, the face and gesture recognition network (FG-NET). FG-NET is a popular, publicly available database used for age estimation, gender classification, face recognition, and other demographic analysis tasks [46]. It contains 1002 images from 82 subjects: 47 males and 35 females with ages varying from 0 to 69 years [46].

For each image, 109 features are extracted using the Active Appearance Model (AAM), a commonly adopted appearance-based approach that models the shape and texture of the face [14, 11]. As in Section 5.4, the radial kernel defined in Eq. (24) is chosen for each of the DR methods KPCA, SKPCA, and KLDA. Additionally, the modified link function from Eq. (25) is applied in the SKPCA algorithm. Thus, the tuning parameter $\delta$ in the radial kernel and $\eta$ in the modified link function must be selected. As in our experiments on Morph-II, linear SVM is chosen as the classifier for FG-NET. On Morph-II, values of the cost parameter $c$ ranging from $10^{-8}$ to $10^{8}$ were tested. On FG-NET, we have observed convergence issues in the SVM algorithm for values of $c$ exceeding 10, so only the values $10^{-8}$ , $10^{-7}$ , $\ldots$ , $10^{-1}$ , 1, 10 are tested. The considered tuning parameters are summarized in Table 6.

Table 6
Parameter summary for FG-NET

Dimension reduction	KPCA	$\delta=$ 3.2, 3.2 $\overline{6}$ , 3. $\overline{3}$ , 3.4, 3.4 $\overline{6}$ , 3.5 $\overline{3}$ , 3.6, 3. $\overline{6}$ , 3.7 $\overline{3}$ , 3.8
	SKPCA	$\delta=$ 0.0098
		$\eta=$ 0.001, 0.01, 0.1, 1
	KLDA	$\delta=$ 3, 3. $\overline{5}$ , 4. $\overline{1}$ , 4. $\overline{6}$ , 5. $\overline{2}$ , 5. $\overline{7}$ , 6. $\overline{3}$ , 6. $\overline{8}$ , 7. $\overline{4}$ , 8
Classifier	Linear SVM	$c=$ 10 ${}^{-8}$ , $\ldots$ , 10 ${}^{-1}$ ,1, 10

For cross-validation, we use leave-one-person-out (LOPO), the most well-accepted scheme for FG-NET [46]. LOPO is a variation of $k$ -fold cross-validation that produces independent training and testing folds in longitudinal datasets. The number of folds $k$ is set equal to the number of subjects in the dataset, so $k=$ 82 here. For $i=$ 1, 2, $\ldots$ 82, testing fold $i$ contains only images of person $i$ , while training fold $i$ contains all remaining images. Similarly to on Morph-II, we choose the mean classification accuracy over the testing folds to be the evaluation criterion.

Table 7

Gender classification results on FG-NET

Method	Parameters	Acc ${}^{(1)}$	TPR ${}^{(2)}$	TNR ${}^{(3)}$
KPCA	$\delta=$ 3.2 $\overline{6}$ , $c=$ 10	0.7025	0.7325	0.6621
	$\delta=$ 3. $\overline{3}$ , $c=$ 10	0.6932	0.7233	0.6528
	$\delta=$ 3.4, $c=$ 10	0.6801	0.6651	0.7001
SKPCA	$\delta=$ 0.0098, $\eta=$ 0.1, $c=$ 10	0.7154	0.7542	0.6633
	$\delta=$ 0.0098, $\eta=$ 1, $c=$ 0.1	0.6933	0.7413	0.6289
	$\delta=$ 0.0098, $\eta=$ 0.01, $c=$ 0.1	0.6893	0.7701	0.5809
KLDA	$\delta=$ 3, $c=$ 0.01	0.7225	0.7593	0.6730
	$\delta=$ 5. $\overline{7}$ , $c=$ 1	0.7176	0.7810	0.6324
	$\delta=$ 8, $c=$ 0.1	0.7131	0.7431	0.6727

Figure 6.

Receiver operating characteristic (ROC) curve and area under the curve (AUC) are compared by method for gender classification on FG-NET. Each color corresponds to a DR method.

For each fold, we transform and reduce the dimension of the features through each DR method. In all cases, a dimension of 100 is retained to facilitate comparison with the results on Morph-II. The transformed, dimension-reduced features then predict the gender of the testing fold’s images through a linear SVM. The predicted classes from SVM are also mapped to probabilities through [48], similarly as in Section 6. The gender classification accuracy is calculated for the testing fold. Finally, all such testing classification accuracies are averaged to compute the mean classification accuracy from testing; the testing probabilities are used to form ROC curves.

The optimum gender classification results on FG-NET are presented in Table 7. The maximum classification accuracy of about 72.25% is achieved by KLDA. For other choices of parameters, KLDA reaches above 71% accuracy, which is close to the maximum accuracy attained by SKPCA. Meanwhile, the peak accuracy reached by KPCA is 70.25%. In general here, KLDA is observed to outperform SKPCA and KPCA, while SKPCA tends to surpass KPCA. In most cases, the probability of correctly classifying males (sensitivity/true positive rate) is higher than the probability of correctly classifying females (specificity/true negative rate). For each DR method, an ROC curve (corresponding to the results from Table 7 with maximal mean classification accuracy) is displayed in Fig. 6. The area under the curve (AUC) is highest for KLDA, followed by KPCA then SKPCA.

Overall, the gender classification results on Morph-II are stronger than on FG-NET. Lower accuracy on FG-NET could be caused by the greater number of minors (aged 0-18), who have been more difficult to classify than adults in some studies [65, 66]. Additionally, there are substantially fewer faces for training in FG-NET versus Morph-II (under 1000 versus 10280 images). Another contributor could be the choice of features and its dimension; the AAM features have dimension 109 on FG-NET, while the HOG, LBP, and BIF features have dimensions ranging from 500 to thousands on Morph-II. SKPCA reaches peak performance on Morph-II, while KLDA attains optimal results on FG-NET. However, the results on Morph-II and FG-NET are similar in that the supervised methods KLDA and SKPCA outperform the unsupervised method KPCA for gender classification. Further, both datasets evidence that female faces are more challenging to classify than male faces.

8. Computational framework for practical systems

To tackle the challenges of high dimensionality and intensive computation for large-scale databases (like Morph-II, as shown in the Time column of Table 4) in real-world applications, we propose a computational framework to substantially decrease runtime.

Our approach involves parallel computing, the bootstrap resampling method, and ensemble learning. Let $M1$ denote the main training set and M2 the testing set. If M1 is very large, we can save some time by drawing bootstrapped samples from M1. Let $S_{i}$ denote the $i$ th bootstrapped sample from M1. Send $S_{i}$ to a core (or processor), Core $i$ . Train the model on $S_{i}$ . Test on the full testing set M2, obtaining a set of gender predictions corresponding to Core $i$ and $S_{i}$ . Repeat this process for all bootstrapped samples and corresponding cores $i$ . The final predictions are obtained by taking the majority rule of the predictions from all $i$ cores and samples. Hence, the results from this scheme approximate the results from the full Morph-II. This framework is summarized in Fig. 7.

To explore the effectiveness, this framework is applied to Morph-II with a selection of BIF, LBP, and HOG features as preliminary studies. This experiment is implemented through the HiPerGator 2.0 supercomputer at University of Florida with five cores per combination of feature and dimension reduction method. Following the subsetting scheme discussed in Section 5.2, for simplicity, we consider only the case of bootstrapping image samples from $S_{1}$ for training, while each image from $S_{2}\cup R$ is used for testing.

Table 8
Classification results based on bootstrapping

Method	Feature	Accuracy	Memory (gb)	Time (min)
KPCA	BIF $s=$ s7–37, $\gamma=$ 0.4	0.9330	27.59	90
	HOG $s=$ 12, $o=$ 8	0.9178	29.77	101
	LBP $s=$ 10, $r=$ 1	0.8927	25.85	37
SKPCA	BIF $s=$ s7–37, $\gamma=$ 0.4	0.9417	53.28	89
	HOG $s=$ 12, $o=$ 8	0.9056	51.43	74
	LBP $s=$ 10, $r=$ 1	0.9274	20.33	24
KLDA	BIF $s=$ s7–37, $\gamma=$ 0.4	0.9416	30.99	100
	HOG $s=$ 12, $o=$ 8	0.9133	25.42	102
	LBP $s=$ 10, $r=$ 1	0.9118	17.05	26

Figure 7.

Flowchart representing the parallel computational framework for practical systems proposed for Morph-II and other large datasets.

We evaluate this framework by comparing the approximated results in Table 8 to the results from Table 4. For each combination of feature and dimension reduction method, each of the five cores independently trains a bootstrapped sample of 1000 images from $S_{1}$ and tests on $S_{2}\cup R$ . Then the gender predictions over all five cores are compared with a simple majority rule; e.g., if an image is predicted male for three images and female for two images, the final gender prediction is male. The times in Table 8 are the total runtimes for this process, which include training and testing on HPC. Therefore, the times and memory can be compared between Tables 4 and 8. A distinction is that in Table 4, results are averaged for the alternating scheme, while in Table 8, the results are only from when $S_{1}$ is used for training and $S_{2}\cup R$ for testing.

It is shown in Table 8 that, in many cases, the accuracy rates from the approximations are similar to those from the main approach in Table 4. This is a very good result, especially considering that the bootstrapping approach uses no more than 5000 images total for training, while the main approach used all 10280 images for training. This finding suggests that our methods may perform reasonably well on Morph-II with smaller training sets. The most substantial difference between the bootstrapped approach and the main approach is in the runtime. For all combinations of features and dimension reduction methods, the bootstrapping approach has decreased the runtime to under two hours. Meanwhile, the main approach in Table 4 yields runtimes exceeding 20 hours. Hence, our preliminary results indicate the parallel approximation approach can attain similar accuracy rates to the main approach, while substantially saving time. Such a result is promising for practical gender classification systems, where gender predictions must be made in real-time.

9. Conclusion

We have performed a comparative study of the nonlinear dimension reduction methods KPCA, SKPCA, and KLDA. These kernel-based methods are first applied to three simulated datasets for visualization and comparison. SKPCA and KLDA outperform KPCA, reinforcing the need for supervised approaches in classification tasks. The radial kernel performed well, encouraging its use for face analysis.

Next, we have proposed and evaluated a new machine learning process for Morph-II. First, we use a novel subsetting scheme that reduces class imbalances while establishing independence between training and testing sets. Then we preprocess Morph-II photographs and extract three appearance-based features: HOG, LBP, and BIF. We transform and reduce the dimension of these features through KPCA, SKPCA, and KLDA. Linear SVM classifies the gender of Morph-II subjects, reaching accuracy rates of 95%. With promising preliminary results on Morph-II, a practical computational framework is offered that reduces runtime through parallelization and approximation.

The performance of the dimension reduction methods are further compared through an application to the FG-NET dataset. Images are represented through the appearance-based AAM features; transformed and reduced in dimension through KPCA, SKPCA, and KLDA; and classified as containing a male or female subject through linear SVM. While SKPCA performed optimally on Morph-II, KLDA reached top performance on FG-NET with 72% leave-one-person-out (LOPO) accuracy.

Further directions of research involve automatic tuning parameter selection, reduction of computational cost, and application to other face analysis tasks. Our approach could yield improved results with better choices of parameters, but it is impossible to anticipate and try all combinations. Automatic parameter selection for kernels could help identify a good set of parameters more easily. Perhaps the most important future direction of research on Morph-II is to reduce computational cost. For many practical demographic analysis systems, predictions must be made in real-time. For our gender classification methods, our parallel approximation approach substantially reduced runtime while attaining similar accuracy rates to the main approach. Such computational strategies should be further investigated to help bring gender classification and other face analysis tasks to practical implementation. Finally, our machine learning pipeline for Morph-II could be generalized to race classification or even age estimation.

Footnotes

Acknowledgments

This material is based in part upon work supported by the National Science Foundation under Grant Numbers DMS-1659288. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The authors would like to thank the reviewers for the helpful comments that significantly improve the presentation of the paper.

References

Alexandre

L.A.

, Gender recognition: A multiscale decision fusion approach, Pattern Recognition Letters 31(11) (2010), 1422–1427.

Antipov

Baccouche

Berrani

S.A.

and Dugelay

J.-L.

, Effective training of convolutional neural networks for face-based gender and age prediction, Pattern Recognition 72 (2017), 15–26.

Antipov

Berrani

S.-A.

and Dugelay

J.-L.

, Minimalistic cnn-based ensemble model for gender prediction from face images, Pattern Recognition Letters 70 (2016), 59–65.

Ashtiani

and Ghodsi

, A dimension-independent generalization bound for kernel supervised principal component analysis, In Feature Extraction: Modern Questions and Challenges, (2015), 19–29.

Barshan

Ghodsi

Azimifar

and Jahromi

M.Z.

, Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds, Pattern Recognition 44(7) (2011), 1357–1371.

Barzilay

and Brailovsky

V.L.

, On domain knowledge and feature selection using a support vector machine, Pattern Recognition Letters 20(5) (1999), 475–484.

Baudat

and Anouar

, Generalized discriminant analysis using a kernel approach, Neural Computation 12(10) (2000), 2385–2404.

Belhumeur

P.N.

Hespanha

J.P.

and Kriegman

D.J.

, Eigenfaces vs. fisherfaces: Recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7) (1997), 711–720.

Burton

A.M.

Bruce

and Dench

, What’s the difference between men and women? evidence from facial measurement, Perception 22(2) (1993), 153–176.

10.

Byun

and Lee

S.-W.

, Applications of support vector machines for pattern recognition: A survey, In Pattern Recognition with Support Vector Machines, Springer, (2002), 213–236.

11.

Cootes

T.F.

Edwards

G.J.

and Taylor

C.J.

, Active appearance models, In European Conference on Computer Vision, Springer, (1998), 484–498.

12.

Cottrell

G.W.

and Metcalfe

, Empath: Face, emotion, and gender recognition using holons, In Advances in Neural Information Processing Systems, (1991), 564–571.

13.

Dimitriadou

Hornik

Leisch

Meyer

and Weingessel

, Misc functions of the department of statistics (e1071), tu wien, R package Version, (2005), 1–5.

14.

Edwards

G.J.

Taylor

C.J.

and Cootes

T.F.

, Interpreting face images using active appearance models, In Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, IEEE, (1998), 300–305.

15.

Fewzee

and Karray

, Dimensionality reduction for emotional speech recognition, In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), IEEE, (2012), 532–537.

16.

Fisher

R.A.

, The use of multiple measurements in taxonomic problems, Annals of Human Genetics 7(2) (1936), 179–188.

17.

Fodor

I.K.

, A survey of dimension reduction techniques, Technical report, Lawrence Livermore National Lab., CA (US), (2002).

18.

Guo

and Huang

T.S.

, Age synthesis and estimation via faces: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 32(11) (2010), 1955–1976.

19.

Golomb

B.A.

Lawrence

D.T.

and Sejnowski

T.J.

, Sexnet: A neural network identifies sex from human faces, In NIPS, volume 1, page 2, 1990.

20.

Gretton

Bousquet

Smola

and Schölkopf

, Measuring statistical dependence with hilbert-schmidt norms, In International Conference on Algorithmic Learning Theory Springer, (2005), 63–77.

21.

Guo

Dyer

C.R.

and Huang

T.S.

, Is gender recognition affected by age? In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, IEEE, (2009), 2032–2039.

22.

Guo

and Mu

, Human age estimation: What is the influence across race and gender? In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, IEEE, (2010), 71–78.

23.

Guo

and Mu

, A study of large-scale ethnicity estimation with gender and age variations, In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, IEEE, (2010), 79–86.

24.

Guo

and Mu

, Simultaneous dimensionality reduction and human age estimation via kernel partial least squares regression, In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, (2011), 657–664.

25.

Guo

and Mu

, A framework for joint estimation of age, gender and ethnicity on a large database, Image and Vision Computing 32(10) (2014), 761–770.

26.

Han

Otto

Liu

and Jain

A.K.

, Demographic estimation from face images: Human vs. machine performance, IEEE Transactions on Pattern Analysis & Machine Intelligence (6) (2015), 1148–1161.

27.

Hinton

G.E.

and Salakhutdinov

R.R.

, Reducing the dimensionality of data with neural networks, Science 313(5786) (2006), 504–507.

28.

Hotelling

, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology 24(6) (1933), 417.

29.

Izenman

A.J.

, Modern multivariate statistical techniques, Regression, Classification and Manifold Learning, (2008).

30.

Japkowicz

and Stephen

, The class imbalance problem: A systematic study, Intelligent Data Analysis 6(5) (2002), 429–449.

31.

Johansson

and Johansson

, Interactive dimensionality reduction through user-defined combinations of quality metrics, IEEE Transactions on Visualization and Computer Graphics 15(6) (2009), 993–1000.

32.

Karg

Jenke

Seiberl

Kühnlenz

Schwirtz

and Buss

, A comparison of pca, kpca and lda for feature extraction to recognize affect in gait kinematics, In Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, IEEE, (2009), 1–6.

33.

Kim

K.I.

Jung

and Kim

H.J.

, Face recognition using kernel principal component analysis, IEEE Signal Processing Letters 9(2) (2002), 40–42.

34.

Kling

, Morph-ii: Feature vector documentation: Nsf-reu site at unc wilmington, http://libres.uncg.edu/ir/uncw/f/wangy2018-1.pdf, (2017).

35.

Lee

J.A.

and Verleysen

, Nonlinear Dimensionality Reduction, Springer Science & Business Media, 2007.

36.

and Yuan

, 2d-lda: A statistical linear discriminant analysis for image matrix, Pattern Recognition Letters 26(5) (2005), 527–532.

37.

Lian

H.C.

and Lu

B.L.

, Multi-view gender classification using local binary patterns and support vector machines, In International Symposium on Neural Networks, Springer, (2006), 202–209.

38.

Plataniotis

K.N.

and Venetsanopoulos

A.N.

, Face recognition using kernel direct discriminant analysis algorithms, IEEE Transactions on Neural Networks 14(1) (2003), 117–126.

39.

and Fu

, Manifold Learning Theory and Applications, CRC press, 2011.

40.

Mäkinen

and Raisamo

, An experimental comparison of gender classification methods, Pattern Recognition Letters 29(10) (2008), 1544–1556.

41.

Martínez

A.M.

and Kak

A.C.

, Pca versus lda, IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2) (2001), 228–233.

42.

Mika

Ratsch

Weston

Scholkopf

and Mullers

K.-R.

, Fisher discriminant analysis with kernels, In Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop., IEEE, (1999), 41–48.

43.

Motai

, Kernel association for classification and prediction: A survey, IEEE Transactions on Neural Networks and Learning Systems 26(2) (2015), 208–223.

44.

C.B.

Tay

Y.H.

and Goi

B.M.

, Vision-based human gender recognition: A survey, arXiv preprint arXiv:12041611, (2012).

45.

Nhan Duong

Luu

, Quach

and Bui

T.D.

, Beyond principal components: Deep boltzmann machines for face modeling, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), 4786–4794.

46.

Panis

Lanitis

Tsapatsoulis

and Cootes

T.F.

, Overview of research on facial ageing using the fg-net ageing database, IET Biometrics 5(2) (2016), 37–46.

47.

Pearson

, Liii. on lines and planes of closest fit to systems of points in space, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2(11) (1901), 559–572.

48.

Platt

et al., Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Advances in Large Margin Classifiers 10(3) (1999), 61–74.

49.

Poggio

Brunelli

and Poggio

, Hyberbf networks for gender classification, (1992).

50.

Qiu

Mei

and Qiu

M.Y.

, Package ‘rarpack’, 2016.

51.

Rao

C.R.

, The utilization of multiple measurements in problems of biological classification, Journal of the Royal Statistical Society. Series B (Methodological) 10(2) (1948), 159–203.

52.

Ricanek

and Tesafaye

, Morph: A longitudinal image database of normal adult age-progression, In Automatic Face and Gesture Recognition, 2006. FGR 2006. 7th International Conference on, IEEE, (2006), 341–345.

53.

Samadani

A.-A.

Ghodsi

and Kulić

, Discriminative functional analysis of human movements, Pattern Recognition Letters 34(15) (2013), 1829–1839.

54.

Schölkopf

Herbrich

and Smola

A.J.

, A generalized representer theorem, In International Conference on Computational Learning Theory, Springer, (2001), 416–426.

55.

Schölkopf

Simard

Smola

A.J.

and Vapnik

, Prior knowledge in support vector kernels, In Advances in Neural Information Processing Systems, (1998), 640–646.

56.

Schölkopf

Smola

and Müller

K.R.

, Kernel principal component analysis, In International Conference on Artificial Neural Networks, Springer, (1997), 583–588.

57.

Schölkopf

Smola

and Müller

K.R.

, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation 10(5) (1998), 1299–1319.

58.

Shan

, Learning local binary patterns for gender classification on real-world face images, Pattern Recognition Letters 33(4) (2012), 431–437.

59.

Shawe-Taylor

and Cristianini

, Kernel Methods for Pattern Analysis, Cambridge university press, (2004).

60.

Sorzano

C.O.S.

Vargas

and Montano

A.P.

, A survey of dimensionality reduction techniques, arXiv preprint arXiv:14032877, (2014).

61.

Steinwart

and Christmann

, Support Vector Machines, Springer Science & Business Media, (2008).

62.

Sun

Zhang

Sun

and Tan

, Demographic analysis from biometric data: Achievements, challenges, and new frontiers, IEEE Transactions on Pattern Analysis and Machine Intelligence 40(2) (2018), 332–351.

63.

Turk

and Pentland

, Eigenfaces for recognition, Journal of Cognitive Neuroscience 3(1) (1991), 71–86.

64.

Wahba

, Spline Models for Observational Data, 59 Siam, (1990).

65.

Wang

Chen

Watkins

and Ricanek

, Modified supervised kernel pca for gender classification, In International Conference on Intelligent Science and Big Data Engineering, Springer, (2015), 60–71.

66.

Wang

Ricanek

Chen

and Chang

, Gender classification from infants to seniors, In 2010 Fourth IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS), IEEE, (2010), 1–6.

67.

Wiskott

Fellous

J.M.

Krüger

and Von der Malsburg

, Face recognition and gender determination, (1995).

68.

Bowers

D.M.

Huynh

T.T.

and Souvenir

, Biomedical video denoising using supervised manifold learning, In Biomedical Imaging (ISBI), 2013 IEEE 10th International Symposium on, IEEE, (2013), 1244–1247.

69.

Xia

Sun

and Lu

B.-L.

, Multi-view gender classification based on local gabor binary mapping pattern and support vector machines, In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, IEEE, (2008), 3388–3395.

70.

Xie

Luu

and Savvides

, A robust approach to facial ethnicity classification on large scale face databases, In Biometrics: Theory, Applications and Systems (BTAS), 2012 IEEE Fifth International Conference on, IEEE, (2012), 143–149.

71.

Yang

H.F.

Lin

B.Y.

Chang

K.Y.

and Chen

C.S.

, Automatic age estimation from face images via deep ranking, Networks 35(8) (2013), 1872–1886.

72.

Yang

Jin

Yang

J.Y.

Zhang

and Frangi

A.F.

, Essence of kernel fisher discriminant: Kpca plus lda, Pattern Recognition 37(10) (2004), 2097–2100.

73.

Yang

Peng

Ward

M.O.

and Rundensteiner

E.A.

, Interactive hierarchical dimension ordering, spacing and filtering for exploration of high dimensional datasets, In IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714), (2003), 105–112.

74.

Yang

and Yang

J.Y.

, Why can lda be performed in pca transformed space? Pattern Recognition 36(2) (2003), 563–566.

75.

Yang

Zhang

Frangi

A.F.

and Yang

J.Y.

, Two-dimensional pca: a new approach to appearance-based face representation and recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 26(1) (2004), 131–137.

76.

Yang

and Ai

, Demographic classification with local binary patterns, In International Conference on Biometrics, Springer, (2007), 464–473.

77.

Shi

and Shi

, A comparative study of pca, lda and kernel lda for image classification, In Ubiquitous Virtual Reality, 2009. ISUVR’09. International Symposium on, IEEE, (2009), 51–54.

78.

Lei

and Li

S.Z.

, Age estimation by multi-scale convolutional network, In Asian Conference on Computer Vision, Springer, (2014), 144–158.

79.

Yin

Liu

Jin

and Yang

, Kernel sparse representation based classification, Neurocomputing 77(1) (2012), 120–128.

80.

Yip

Bingham

Kempfert

Fabish

Kling

Chen

and Wang

, Preliminary studies on a large face database, arXiv preprint arXiv:181106446, (2018).

81.

Yip

Towner

Kling

Chen

and Wang

, Image pre-processing using opencv library on morph-ii face database, arXiv preprint arXiv:181106934, (2018).

82.

Zhao

Krishnaswamy

Chellappa

Swets

D.L.

and Weng

, Discriminant analysis of principal components for face recognition, In Face Recognition, Springer, (1998), 73–85.

A comparison study on nonlinear dimension reduction methods with kernel variations: Visualization,optimization and classification

Abstract

Keywords

1. Introduction

2. Associated work

3. Kernel-based dimension reduction methods

3.1 Notations

5.1 Longitudinal face database

5.2 Subsetting scheme

Table 2 Parameter summary

6. Experiment results

Table 6 Parameter summary for FG-NET

Table 8 Classification results based on bootstrapping

Footnotes

Acknowledgments

References

Table 2
Parameter summary

Table 6
Parameter summary for FG-NET

Table 8
Classification results based on bootstrapping