Abstract
Linear Discriminant Analysis (LDA) is a very common technique for dimensionality reduction problems as a pre-processing step for machine learning and pattern classification applications. At the same time, it is usually used as a black box, but (sometimes) not well understood. The aim of this paper is to build a solid intuition for what is LDA, and how LDA works, thus enabling readers of all levels be able to get a better understanding of the LDA and to know how to apply this technique in different applications. The paper first gave the basic definitions and steps of how LDA technique works supported with visual explanations of these steps. Moreover, the two methods of computing the LDA space, i.e. class-dependent and class-independent methods, were explained in details. Then, in a step-by-step approach, two numerical examples are demonstrated to show how the LDA space can be calculated in case of the class-dependent and class-independent methods. Furthermore, two of the most common LDA problems (i.e. Small Sample Size (SSS) and non-linearity problems) were highlighted and illustrated, and state-of-the-art solutions to these problems were investigated and explained. Finally, a number of experiments was conducted with different datasets to (1) investigate the effect of the eigenvectors that used in the LDA space on the robustness of the extracted feature for the classification accuracy, and (2) to show when the SSS problem occurs and how it can be addressed.
Keywords
Introduction
Dimensionality reduction techniques are important in many applications related to machine learning [15], data mining [6,33], Bioinformatics [47], biometric [61] and information retrieval [73]. The main goal of the dimensionality reduction techniques is to reduce the dimensions by removing the redundant and dependent features by transforming the features from a higher dimensional space that may lead to a curse of dimensionality problem, to a space with lower dimensions. There are two major approaches of the dimensionality reduction techniques, namely, unsupervised and supervised approaches. In the unsupervised approach, there is no need for labeling classes of the data. While in the supervised approach, the dimensionality reduction techniques take the class labels into consideration [15,32].
There are many unsupervised dimensionality reduction techniques such as Independent Component Analysis (ICA) [28,31] and Non-negative Matrix Factorization (NMF) [14], but the most famous technique of the unsupervised approach is the Principal Component Analysis (PCA) [4,62,67,71]. This type of data reduction is suitable for many applications such as visualization [2,40], and noise removal [70]. On the other hand, the supervised approach has many techniques such as Mixture Discriminant Analysis (MDA) [25] and Neural Networks (NN) [27], but the most famous technique of this approach is the Linear Discriminant Analysis (LDA) [50]. This category of dimensionality reduction techniques are used in biometrics [12,36], Bioinformatics [77], and chemistry [11].
The LDA technique is developed to transform the features into a lower dimensional space, which maximizes the ratio of the between-class variance to the within-class variance, thereby guaranteeing maximum class separability [43,76]. There are two types of LDA technique to deal with classes: class-dependent and class-independent. In the class-dependent LDA, one separate lower dimensional space is calculated for each class to project its data on it whereas, in the class-independent LDA, each class will be considered as a separate class against the other classes [1,74]. In this type, there is just one lower dimensional space for all classes to project their data on it.

Visualized steps to calculate a lower dimensional subspace of the LDA technique.
Although the LDA technique is considered the most well-used data reduction techniques, it suffers from a number of problems. In the first problem, LDA fails to find the lower dimensional space if the dimensions are much higher than the number of samples in the data matrix. Thus, the within-class matrix becomes singular, which is known as the small sample problem (SSS). There are different approaches that proposed to solve this problem. The first approach is to remove the null space of within-class matrix as reported in [56,79]. The second approach used an intermediate subspace (e.g. PCA) to convert a within-class matrix to a full-rank matrix; thus, it can be inverted [4,35]. The third approach, a well-known solution, is to use the regularization method to solve a singular linear systems [38,57]. In the second problem, the linearity problem, if different classes are non-linearly separable, the LDA cannot discriminate between these classes. One solution to this problem is to use the kernel functions as reported in [50].
The brief tutorials on the two LDA types are reported in [1]. However, the authors did not show the LDA algorithm in details using numerical tutorials, visualized examples, nor giving insight investigation of experimental results. Moreover, in [57], an overview of the SSS for the LDA technique was presented including the theoretical background of the SSS problem. Moreover, different variants of LDA technique were used to solve the SSS problem such as Direct LDA (DLDA) [22,83], regularized LDA (RLDA) [18,37,38,80], PCA+LDA [42], Null LDA (NLDA) [10,82], and kernel DLDA (KDLDA) [36]. In addition, the authors presented different applications that used the LDA-SSS techniques such as face recognition and cancer classification. Furthermore, they conducted different experiments using three well-known face recognition datasets to compare between different variants of the LDA technique. Nonetheless, in [57], there is no detailed explanation of how (with numerical examples) to calculate the within and between class variances to construct the LDA space. In addition, the steps of constructing the LDA space are not supported with well-explained graphs helping for well understanding of the LDA underlying mechanism. In addition, the non-linearity problem was not highlighted.
This paper gives a detailed tutorial about the LDA technique, and it is divided into five sections. Section 2 gives an overview about the definition of the main idea of the LDA and its background. This section begins by explaining how to calculate, with visual explanations, the between-class variance, within-class variance, and how to construct the LDA space. The algorithms of calculating the LDA space and projecting the data onto this space to reduce its dimension are then introduced. Section 3 illustrates numerical examples to show how to calculate the LDA space and how to select the most robust eigenvectors to build the LDA space. While Section 4 explains the most two common problems of the LDA technique and a number of state-of-the-art methods to solve (or approximately solve) these problems. Different applications that used LDA technique are introduced in Section 5. In Section 6, different packages for the LDA and its variants were presented. In Section 7, two experiments are conducted to show (1) the influence of the number of the selected eigenvectors on the robustness and dimension of the LDA space, (2) how the SSS problem occurs and highlights the well-known methods to solve this problem. Finally, concluding remarks will be given in Section 8.
Definition of LDA
The goal of the LDA technique is to project the original data matrix onto a lower dimensional space. To achieve this goal, three steps needed to be performed. The first step is to calculate the separability between different classes (i.e. the distance between the means of different classes), which is called the between-class variance or between-class matrix. The second step is to calculate the distance between the mean and the samples of each class, which is called the within-class variance or within-class matrix. The third step is to construct the lower dimensional space which maximizes the between-class variance and minimizes the within-class variance. This section will explain these three steps in detail, and then the full description of the LDA algorithm will be given. Figures 1 and 2 are used to visualize the steps of the LDA technique.

Projection of the original samples (i.e. data matrix) on the lower dimensional space of LDA (
The between-class variance of the ith class (
To calculate the between-class variance (
The transformation matrix (W) will be explained in Section 2.4.
The term
The total between-class variance is calculated as follows, (
The within-class variance of the ith class (
From Equation (5), the within-class variance for each class can be calculated as follows,
Constructing the lower dimensional space
After calculating the between-class variance (
The eigenvalues are scalar values, while the eigenvectors are non-zero vectors, which satisfies the Equation (8) and provides us with the information about the LDA space. The eigenvectors represent the directions of the new space, and the corresponding eigenvalues represent the scaling factor, length, or the magnitude of the eigenvectors [34,59]. Thus, each eigenvector represents one axis of the LDA space, and the associated eigenvalue represents the robustness of this eigenvector. The robustness of the eigenvector reflects its ability to discriminate between different classes, i.e. increase the between-class variance, and decreases the within-class variance of each class; hence meets the LDA goal. Thus, the eigenvectors with the k highest eigenvalues are used to construct a lower dimensional space (
Figure 2 shows the lower dimensional space of the LDA technique, which is calculated as in Fig. 1 (step (G)). As shown, the dimension of the original data matrix (
Figure 3 shows a comparison between two lower-dimensional sub-spaces. In this figure, the original data which consists of three classes as in our example are plotted. Each class has five samples, and all samples are represented by two features only ( First, the separation distance between different classes when the data are projected on the first eigenvector ( Second, the within-class variance when the data are projected on
From these two notes, we conclude that the first eigenvector meets the goal of the lower-dimensional space of the LDA technique than the second eigenvector; hence, it is selected to construct a lower-dimensional space.

A visualized comparison between the two lower-dimensional sub-spaces which are calculated using three different classes.
The aim of the two methods of the LDA is to calculate the LDA space. In the class-dependent LDA, one separate lower dimensional space is calculated for each class as follows,

Linear Discriminant Analysis (LDA): Class-Independent

Linear Discriminant Analysis (LDA): Class-Dependent
In this section the detailed steps of the algorithms of the two LDA methods are presented. As shown in Algorithms 1 and 2, the first four steps in both algorithms are the same. Table 1 shows the notations which are used in the two algorithms.
Computational complexity of LDA
In this section, the computational complexity for LDA is analyzed. The computational complexity for the first four steps, common in both class-dependent and class-independent methods, are computed as follows. As illustrated in Algorithm 1, in step (2), to calculate the mean of the ith class, there are
In Algorithm 2, the number of operations to calculate the within-class variance for each class
Notation
Notation
In our case, we assumed that there are 40 classes and each class has ten samples. Each sample is represented by 4096 features (
In this section, two numerical examples will be presented. The two numerical examples explain the steps to calculate the LDA space and how the LDA technique is used to discriminate between only two different classes. In the first example, the lower-dimensional space is calculated using the class-independent method, while in the second example, the class-dependent method is used. Moreover, a comparison between the lower dimensional spaces of each method is presented. In all numerical examples, the numbers are rounded up to the nearest hundredths (i.e. only two digits after the decimal point are displayed).
The first four steps of both class-independent and class-dependent methods are common as illustrated in Algorithms 1 and 2. Thus, in this section, we show how these steps are calculated.
Given two different classes,
To calculate the lower dimensional space using LDA, first the mean of each class
The between-class variance of each class (
Similarly,
The total between-class variance is calculated a follows:
To calculate the within-class matrix, first subtract the mean of each class from each sample in that class and this step is called mean-centering data and it is calculated as follows,
In the next two subsections, two different methods are used to calculate the LDA space.
Class-independent method
In this section, the LDA space is calculated using the class-independent method. This method represents the standard method of LDA as in Algorithm 1.
After centring the data, the within-class variance for each class (
The eigenvalues (
From the above results it can be noticed that, the second eigenvector (
Similarly,

Probability density function of the projected data of the first example, (a) the projected data on
Figure 4 illustrates a probability density function (pdf) graph of the projected data (
The data of each class is completely discriminated when it is projected on the second eigenvector (see Fig. 4(b)) than the first one (see Fig. 4(a). In other words, the second eigenvector maximizes the between-class variance more than the first one.
The within-class variance (i.e. the variance between the same class samples) of the two classes are minimized when the data are projected on the second eigenvector. As shown in Fig. 4(b), the within-class variance of the first class is small compared with Fig. 4(a).
In this section, the LDA space is calculated using the class-dependent method. As mentioned in Section 2.5, the class-dependent method aims to calculate a separate transformation matrix (
The within-class variance for each class (
Similarly,
The eigenvalues (
From the results shown (above) it can be seen that, the second eigenvector of the first class (
Figure 5 shows a pdf graph of the projected data (i.e.
First, the projection data of the two classes are efficiently discriminated.
Second, the within-class variance of the projected samples is lower than the within-class variance of the original samples.

Probability density function (pdf) of the projected data using class-dependent method, the first class is projected on
In these two numerical examples, the LDA space is calculated using class-dependent and class-independent methods.
Figure 6 shows a further explanation of the two methods as following:
Class-Independent: As shown from the figure, there are two eigenvectors, The projected data on the second eigenvector ( The second eigenvector decreases the within-class variance much better than the first eigenvector. Figure 6 illustrates that the within-class variance of the first class ( As a result of the above two findings, Class-Dependent: As shown from the figure, there are two eigenvectors, Projecting the original data on the two eigenvectors discriminates between the two classes. As shown in the figure, the distance between the projected means The within-class variance of each class is decreased. For example, the within-class variance of the first class ( As a result of the above two findings, Class-Dependent vs. Class-Independent: The two LDA methods are used to calculate the LDA space, but a class-dependent method calculates separate lower dimensional spaces for each class which has two main limitations: (1) it needs more CPU time and calculations more than class-independent method; (2) it may lead to SSS problem because the number of samples in each class affects the singularity of SSS problem will be explained in Section 4.2.

Illustration of the example of the two different methods of LDA methods. The blue and red lines represent the first and second eigenvectors of the class-dependent approach, respectively, while the solid and dotted black lines represent the second and first eigenvectors of class-independent approach, respectively.
Although LDA is one of the most common data reduction techniques, it suffers from two main problems: the Small Sample Size (SSS) and linearity problems. In the next two subsections, these two problems will be explained, and some of the state-of-the-art solutions are highlighted.
Linearity problem
LDA technique is used to find a linear transformation that discriminates between different classes. However, if the classes are non-linearly separable, LDA can not find a lower dimensional space. In other words, LDA fails to find the LDA space when the discriminatory information are not in the means of classes. Figure 7 shows how the discriminatory information does not exist in the mean, but in the variance of the data. This is because the means of the two classes are equal. The mathematical interpretation for this problem is as follows: if the means of the classes are approximately equal, so the
One of the solutions of this problem is based on the transformation concept, which is known as a kernel methods or functions [3,50]. Figure 7 illustrates how the transformation is used to map the original data into a higher dimensional space; hence, the data will be linearly separable, and the LDA technique can find the lower dimensional space in the new space. Figure 8 graphically and mathematically shows how two non-separable classes in one-dimensional space are transformed into a two-dimensional space (i.e. higher dimensional space); thus, allowing linear separation.
The kernel idea is applied in Support Vector Machine (SVM) [49,66,69] Support Vector Regression (SVR) [58], PCA [51], and LDA [50]. Let ϕ represents a nonlinear mapping to the new feature space

Two examples of two non-linearly separable classes, top panel shows how the two classes are non-separable, while the bottom shows how the transformation solves this problem and the two classes are linearly separable.
Thus, in kernel LDA, all samples are transformed non-linearly into a new space

Example of kernel functions, the samples lie on the top panel (X) which are represented by a line (i.e. one-dimensional space) are non-linearly separable, where the samples lie on the bottom panel (Z) which are generated from mapping the samples of the top space are linearly separable.
Problem definition
Singularity, Small Sample Size (SSS), or under-sampled problem is one of the big problems of LDA technique. This problem results from high-dimensional pattern classification tasks or a low number of training samples available for each class compared with the dimensionality of the sample space [30,38,82,85].
The SSS problem occurs when the
A matrix is singular if it is square, does not have a matrix inverse, the determinant is zeros; hence, not all columns and rows are independent.
The rank of the matrix represents the number of linearly independent rows or columns.
There are many studies that proposed many solutions for this problem; each has its advantages and drawbacks.
A is a full-rank matrix if all columns and rows of the matrix are independent, (i.e.
Four different variants of the LDA technique that are used to solve the SSS problem are introduced as follows:
PCA+LDA technique. In this technique, the original d-dimensional features are first reduced to h-dimensional feature space using PCA, and then the LDA is used to further reduce the features to k-dimensions. The PCA is used in this technique to reduce the dimensions to make the rank of
Direct LDA technique. Direct LDA (DLDA) is one of the well-known techniques that are used to solve the SSS problem. This technique has two main steps [83]. In the first step, the transformation matrix, W, is computed to transform the training data to the range space of
Regularized LDA technique. In the Regularized LDA (RLDA), a small perturbation is add to the
Null LDA technique. The aim of the NLDA technique is to find the orientation matrix W, and this can be achieved using two steps. In the first step, the range space of the
In many applications, due to the high number of features or dimensionality, the LDA technique have been used. Some of the applications of the LDA technique and its variants are described as follows:
Biometrics applications
Biometrics systems have two main steps, namely, feature extraction (including pre-processing steps) and recognition. In the first step, the features are extracted from the collected data, e.g. face images, and in the second step, the unknown samples, e.g. unknown face image, is identified/verified. The LDA technique and its variants have been applied in this application. For example, in [10,20,41,68,75,83], the LDA technique have been applied on face recognition. Moreover, the LDA technique was used in Ear [84], fingerprint [44], gait [5], and speech [24] applications. In addition, the LDA technique was used with animal biometrics as in [19,65].
Agriculture applications
In agriculture applications, an unknown sample can be classified into a pre-defined species using computational models [64]. In this application, different variants of the LDA technique was used to reduce the dimension of the collected features as in [9,21,26,46,63,64].
Medical applications
In medical applications, the data such as the DNA microarray data consists of a large number of features or dimensions. Due to this high dimensionality, the computational models need more time to train their models, which may be infeasible and expensive. Moreover, this high dimensionality reduces the classification performance of the computational model and increases its complexity. This problem can be solved using LDA technique to construct a new set of features from a large number of original features. There are many papers have been used LDA in medical applications [8,16,39,52–55].
Packages
In this section, some of the available packages that are used to compute the space of LDA variants. For example, WEKA6
In this section, two experiments were conducted to illustrate: (1) how the LDA is used for different applications, (2) what is the relation between its parameter (Eigenvectors) and the accuracy of a classification problem, (3) when the SSS problem could appear and a method for solving it.
Experimental setup
This section gives an overview of the databases, the platform, and the machine specification used to conduct our experiments. Different biometric datasets were used in the experiments to show how the LDA using its parameter behaves with different data. These datasets are described as follows:
ORL dataset13
Yale dataset14
2D ear dataset (Carreira-Perpinan,1995)15
In all experiments, k-fold cross-validation tests have used. In k-fold cross-validation, the original samples of the dataset were randomly partitioned into k subsets of (approximately) equal size and the experiment is run k times. For each time, one subset was used as the testing set and the other
The images in all datasets resized to be

Samples of the first individual in: ORL face dataset (top row); Ear dataset (middle row), and Yale face dataset (bottom row).
Dataset description
In all experiments, to show the effect of the LDA with its eigenvector parameter and its SSS problem on the classification accuracy, the Nearest Neighbour classifier was used. This classifier aims to classify the testing image by comparing its position in the LDA space with the positions of training images. Furthermore, class-independent LDA was used in all experiments. Moreover, Matlab Platform (R2013b) and using a PC with the following specifications: Intel(R) Core(TM) i5-2400 CPU @ 3.10 GHz and 4.00 GB RAM, under Windows 32-bit operating system were used in our experiments.
The aim of this experiment is to investigate the relation between the number of eigenvectors used in the LDA space, and the classification accuracy based on these eigenvectors and the required CPU time for this classification.

Accuracy and CPU time of the LDA techniques using different percentages of eigenvectors, (a) Accuracy (b) CPU time.
As explained earlier that the LDA space consists of k eigenvectors, which are sorted according to their robustness (i.e. their eigenvalues). The robustness of each eigenvector reflects its ability to discriminate between different classes. Thus, in this experiment, it will be checked whether increasing the number of eigenvectors would increase the total robustness of the constructed LDA space; hence, different classes could be well discriminated. Also, it will be tested whether increasing the number of eigenvectors would increase the dimension of the LDA space and the projected data; hence, CPU time increases. To investigate these issue, three datasets listed in Table 2 (i.e.
From Fig. 10 it can be noticed that the accuracy and CPU time are proportional with the number of eigenvectors which are used to construct the LDA space. Thus, the choice of using LDA in a specific application should consider a trade-off between these factors. Moreover, from Fig. 10(a), it can be remarked that when the number of eigenvectors used in computing the LDA space was increased, the classification accuracy was also increased to a specific extent after which the accuracy remains constant. As seen in Fig. 10(a), this extent differs from application to another. For example, the accuracy of the ear dataset remains constant when the percentage of the used eigenvectors is more than 10%. This was expected as the eigenvectors of the LDA space are sorted according to their robustness (see Section 2.6). Similarly, in ORL and Yale datasets the accuracy became approximately constant when the percentage of the used eigenvectors is more than 40%. In terms of CPU time, Fig. 10(b) shows the CPU time using different percentages of the eigenvectors. As shown, the CPU time increased dramatically when the number of eigenvectors increases.
The weight of the eigenvector represents the ratio of its corresponding eigenvalue (
These experiments confirmed that increasing the number of eigenvectors will increase the dimension of the LDA space; hence, CPU time increases. Consequently, the amount of discriminative information and the accuracy increases.

The robustness of the first 40 eigenvectors of the LDA technique using
The aim of this experiment is to show when the LDA is subject to the SSS problem and what are the methods that could be used to solve this problem. In this experiment, PCA-LDA [4] and Direct-LDA [22,83] methods are used to address the SSS problem. A set of experiments was conducted to show how the two methods interact with the SSS problem, including the number of samples in each class, the total number of classes used, and the dimension of each sample.
As explained in Section 4, using LDA directly may lead to the SSS problem when the dimension of the samples are much higher than the total number of samples. As shown in Table 2, the size of each image or sample of the ORL dataset is
To address this problem, PCA-LDA and direct-LDA methods were implemented. In the PCA-LDA method, PCA is first used to reduce the dimension of the original data to make
Table 2 illustrates various scenarios designed to test the effect of different dimensions on the rank of

PCA-LDA

Direct-LDA
Accuracy of the PCA-LDA and direct-LDA methods using the datasets listed in Table 2
The bold values indicate that the corresponding methods obtain best performances.
As summarized in Table 3, the rank of
The range of a matrix A represents the column-space of A (
In this paper, the definition, mathematics, and implementation of LDA were presented and explained. The paper aimed to give low-level details on how the LDA technique can address the reduction problem by extracting discriminative features based on maximizing the ratio between the between-class variance,
