Abstract
Brain Computer Interface (BCI) enables us to record and process the information generated by the brain and process them. Due to high variability of the Electroencephalogram (EEG) data, multiple trails are recorded for a particular task. The present work aims to improve the accuracy for motor imagery task classification by selecting the most prominent trail from the multiple trails recorded during motor imagery. In this paper, we propose a novel weight optimization algorithm for common spatial filtering (CSP) using evolutionary algorithms (i.e. cuckoo search algorithm (CSA), firefly algorithm (FA) and gravitational search algorithm (GSA)) to select the most prominent trial from the multiple trails recorded for feature extraction. The features extracted from the selected trials were thus used for motor imagery task classification. The performance was evaluated on the extracted features from the selected trials using two classifiers namely linear discriminant analysis (LDA) and support vector machines (SVM). It is observed that FA with band power as a feature gives the best performance in comparison to the earlier reported methods i.e. average, error based and alternating direction method of multipliers (ADMM).
Keywords
Introduction
Brain computer interface (BCI) is one of the recently developed paradigm attempting to merge the fields of machine learning and neuroscience. BCI acts as an interface to facilitate communication between the brain and the external world. A non-invasive method of recording the brain activity is through the EEG (Electroencephalogram) but the recordings have high variability, susceptibility to noise, low spatial resolution. Assessing the quality of these trials is a challenge, which is overcome by weighting the trails according to their quality. For example real and imagery changes of hand and feet movement invoke different signals in different regions of the brain. So it is important to capture the changes of EEG signal according to the different motor tasks such as hand movement, leg movement, tongue movement, arm movement etc. which will lead to better classification [1]. EEG signals are non-stationary and does not present any discriminative information. CSP assigns weights to the channels in a manner that it maximizes the variances and produces the maximum distinction between any two classes of EEG data. Physically, the goal of CSP is to reduce the volume conduction effect and localize the EEG to specific channels. The basic framework of CSP is to create a covariance matrix which generates the maximum variance. The generation of the covariance/weight matrix can be viewed as an optimization problem, diagonalization and Eigen value decomposition methods have been used for the purpose, but they fail to produce discriminative information for multi-class scenario. For multiple trials, such matrices for each trails are averaged to from an average covariance matrix to get a generalized spatial filter, which would be able to give better classification results. However, poor quality trails or the trails contaminated with artifacts leads to a poor estimation of the covariance. The problem of finding optimal weights for different trials is highly non-linear. Hence the present work aims only to generate an optimal weight matrix (CSP filter) for spatially filtering motor-imagery data for multi trail scenario using meta-heuristic algorithms (cuckoo search algorithm (CS), firefly algorithm (FA) and gravitational search algorithm (GSA)). The weight matrix thus generated was used to extract features followed by classification of the EEG data using linear discriminant analysis (LDA) and support vector machine (SVM) and finally comparing them with the existing methods.
Literature survey
The proposed work focuses on the development of an optimally weighted common spatial filter. Hence the literature survey presents the various types of CSP’s available. The common spatial pattern (CSP) technique is an unsupervised spatial filtering approach which is well known in the field of neuro-science. It filters out the unwanted channels and assign weights to only those channels that are important. The CSP Algorithm tries to maximize the covariance between the channels and filters out rest of the channels and gives them in order [9]. CSP does not consider the temporal filtering assuming it to be performed separately. To rectify this further work was carried out where spatial and temporal filtering were performed together to achieve more optimality. In common spatio-spectral patterns (CSSPs) [9] algorithm an augmented data matrix is constructed for each trial that contains two copies of the signal. In common sparse spectral spatial pattern (CSSSPs) [10] multiple time copies of the signal are considered for a channel and a single temporal filter is constructed for all the channels, thus giving better flexibility over the CSSP but a closed form solution is not available. Discriminative Filter Bank Common Spatial Pattern (DFBCSP) [11] algorithm maximizes the variance between the classes by designing multiple spatio-temporal filter pairs. The algorithm cannot optimize higher-dimensional spatio-temporal filters. Several algorithms were designed to optimize the spatial and temporal filter at different stages. These algorithms can be divided into two categories; one is iterative and the other is sequential [11]. It is also important to point out that FBCSP filters perform distinct frequency filtering on the channels but CSP filters different frequency sub bands. In case of iterative algorithms SPEC-CSP and ISSPL [12] the frequency and spatial filters are changed iteratively with fisher ratio for optimizing spectral filters by creating a maximum margin hyper plane. Up till now all algorithms only tried to optimize the weights only in a single trial not considering the case of multi-trial where each trials are different. Assigning weights to the trials according to their quality was then proposed. The first approach was introduced where the weights were assigned to the trial based on the error after joint diagonalization [9]. It is not enough to simply assign weights according to the quality of the trials, removal of the low quality trials to reduce the impact of the low quality trial is important, another method was proposed to assign weights to the trials by joint diagonalization [15]. The ADMM weight optimization technique was used, where it was proposed to assign weights to different trials based on their quality [14]. Riemannian geometry was used for averaging covariance matrices which improved the performance for small dimensional problems, but decreased with the increase in dimensionality [20]. A sparse filter band CSP (SFBCSP) was described by Zhang et al. where CSP features were estimated from multiple signals filtered at a set of overlapping bands. The filter bands that produced the most discriminative features were selected in a supervised way by using sparse regression. However the authors only presented errors in classification accuracies without the actual value of classification accuracy [21]. Alessio et al. used generic learning to reduce the covariance matrices of two classes into an identity matrix and a generic covariance matrix and obtained a good classification accuracy of 60–70% but on different dataset [22]. Shiu et al. proposed an improved DFBCSP method using mutual information for selecting the bands that gave the most optimal results and then features were calculated onthem [23].
Methodology
The steps followed in the proposed approach are given in the Fig. 1 for the motor imagery classification. First the signals are preprocessed, followed by extraction of features from the signals and finally classification on the extracted features.

Methodology for classification using trial selection.
In the preprocessing phase the signal is prepared for processing, where artifacts are removed from the signal, weights are assigned to the trials based on their quality and finally spatially filtered for classifying the signals.
Artifact removing
Artifact can be described as noise which corrupts the EEG signals. They originate from various sources, which include the power-line interference, body and eye movements made by the subject himself or from the blinking of eyes by the subject. There are three main approaches to handle eye artifacts:
Trial selection
The EEG signals are highly variable and have a low spatial resolution in nature, hence to ensure the reliability of the EEG trials for motor imagery task multiple trials are recorded.
The various ways to get the prominent trials are as given below.
Average
Each trial plays equal part in average selection.
Weights are assigned to each trial so that better quality trials contribute more and bad quality trials contribute less.
Where Fast Frobenius AJD (FFDIAG)(uses frobenius norm of matrices to find optimized Eigen vectors). Jacobi AJD (uses angle to find optimized Eigen vectors) [15].
In the proposed work, Fast Frobenius AJD method is used because it is faster than any other AJD method.
Given two classes, d1 and d2, the weights of k
d
trial matrices S
k
(k ∈ d
i
) is designed. The matrix should be diagonalizable if the observed EEG is stationary, but this assumption is not true. Hence, it cannot be diagonalized exactly but an approximate joint diagonality criteria can be considered given as:
Here U is a common factor, E
k
and Σ
k
are the error matrix and the diagonal matrix at the kth trial respectively. The trials, for which the Frobenius norm of E
k
(denoted by ||E
k
||
F
) is large are regarded as low-quality trials, and hence smaller weights are assigned to those trials. Thus, we simply choose
Where
And
Where Sk is the trial matrix. The value of α is taken as 1, 0.1 and 0.01. But it is observed that 0.01 gives better performance than the rest. Brief description of the various meta-heuristic algorithms used in this work is given in the next section.
Common spatial pattern (CSP) is a widespread method used in the domain of signal processing for disassociating a multivariate signal into additive sub-components having the maximum differences in variance between two windows. A CSP filter can be represented as:
Where, W = Rp×ch is the spatial filter matrix and S ∈ Rp×times represents the matrix of filtered signal. The CSP can be expressed as:
Subject to, W
T
(Σ1 + Σ2) W = I, Where
The above equation can be solved by generalized Eigen value decomposition, i.e. by diagonalizing ∑1 & ∑2 and getting the channels that maximizes the covariance between the two signals and satisfies the subjected conditions. Where d1 and d2 represents the classes to be classified. By this method only the desired channels are retained and the unwanted channels are filtered out. CSP is useful only in two class case. For multi-class CSP there are twomethods: One class vs the rest- In this approach the class for which CSP is being calculated in kept on one side and the rest of the classes are kept on the other side and then CSP algorithm is applied and the optimal channels are found out for further steps. Combination of any two classes-In this approach combination of any two classes is taken and spatial filtering is performed. By applying this method the unwanted channels are removed from the signals.
A meta-heuristic algorithm is an iterative search process which incorporates intelligent features for enhancing exploration and exploitation in the search for near global optimal solutions. Meta-heuristic algorithms vary in range from simple local search procedures to complex learning processes.
Examples of meta-heuristic algorithms include GSA, Bat Algorithm, Firefly Algorithm, Ant colony optimization etc. In this paper CSA, FA and GSA are used for optimizing the weights of the trials, using the Equation (5) as the objective function and their comparative performance are investigated.
Gravitational search algorithm
GSA was described by Rashedi et al. in 2009 for solving optimization problems. This is a search strategy which uses the Newton’s law of motion and the law of gravitation. In this algorithm the gravity of each agent with respect to others are compared. Agents are the objects and their fitness’s are taken as Mass. The movement of objects towards each other due to gravitational force results in global movement (exploration). The lighter objects moves fast but heavier objects move slowly. The slow movement of heavier objects results in local search (exploitation). The force between two object is given as:
Where F represents the gravitational force between the objects, G is the universal gravitational constant and M1 and M2 are the masses of the two objects respectively and r represents the distance between them. For details about the algorithm [19] may be referred.
Cuckoo search algorithm mimics the behavior of the cuckoos in obligate brood parasitism, where the cuckoos lay their eggs in a host bird’s nest. From those eggs, the eggs having maximum resemblance to that of the host bird’s eggs have the probability to grow up to a mature cuckoo. Rest of the eggs are detected and destroyed by the host bird. The eggs that survive in a particular area represent the favorability of the nests in that particular area. Based on the probability of discovery by the host bird and the location of the nest, the algorithm optimizes the breeding behavior of the cuckoos and can be used for various optimization problems. For details about the algorithm [16] may be referred.
Firefly algorithm
Firefly is a meta-heuristic algorithm for optimization based on the behavior of firefly insects. Firefly uses their flashing behavior for sending signal to the opposite sex, to attract other fireflies etc. But mathematically all fireflies are unisex, so any firefly can be attracted to any other firefly. Attractiveness of a firefly is enumerated by its brightness, the brighter the firefly, the more attractive is the firefly and the lesser bright firefly eventually fly towards them. But the intensity of brightness decreases with the increase in distance. If there are no brighter fireflies then they fly randomly. For details about the algorithm [17] may be referred.
Feature extraction
We have considered two methods for feature extraction in this work, namely band power and time domain parameter.
Band power
In this method the Butterworth infinite impulse response filter is used. So the signal x[t] contains only the required frequency components. The average power for a window size w is:
This means the power is averaged over window w. The final feature value is equal to
The variance of first k derivative of the signal is called time domain parameter (TDP).
The obtained value is later smoothed using exponential moving avg. window filter.
Where, p i is the input signal and y is the filtered signal. The u value is used as a parameter to calculate the time domain parameter. So the final feature is equal to ln(y[n]).
Classification is the two-step process. In first step the classifier is trained with the help of the data set whose class is already known formally known as training data. And in second step we apply trained classifier on the data set whose class is not known formally called test data. In the proposed work, linear discriminant analysis (LDA) and support vector machine (SVM) have been used for classifying EEG data.
Linear discriminant analysis
The Fisher’s Linear discriminate analysis (LDA) classifies data based on mean and covariance matrices of patterns for individual classes. A d-dimensional vector x is transformed to a scalar z as:
The LDA gives an optimal projection w so that the distribution of z is easy to discriminate. The creation of LDA is given by
Where, m1 and m2 denote averages for Z n ∈ class1 and Z n ∈ class2, respectively. S1 and S2 denote the scatters for Z n ∈ class1 and Z n ∈ class2, respectively.
Support vector machine classifier is a separating hyper-plane. This hyper-plane is chosen such that it maximizes the margin between two classes. So the most important training points are the support vectors. The hyper-plane is defined as
The goal of this hyper-plane is to divide the classes in such a way that it maximizes the margin between the classes. In this work soft margin method is used which tolerates some error to maximize the margin. The soft margin tool introduces a C parameter which is used to control the importance of proper classification of the classifier. If the value of the C parameter is large then the algorithm does not allow miss-classification. SVM can be transformed into the nonlinear method of classification if nonlinear kernel function is used. In this work Gaussian kernel function is used as follow.
The value of the function is in between 0 and 1 depending on the similarity of x and y (1 if the two values are equal). If the features are calculated using the Gaussian kernel, non-linear class boundaries can be learned by the SVM classifier. The σ parameter (standard deviation) controls the width of the kernel function (the larger the σ value, the wider the function).
Data description
Data is taken from BCI Competition IV-2a dataset provided by Institute for Knowledge Discovery, Graz University of Technology Austria. The data describes four different motor imagery tasks imagining left hand movement, right hand movement, both feet movement and tongue movement collected from 9 subjects employing a cue based paradigm. Two sessions were recorded and each session consisted of 6 runs. Again each run consisted of 48 trial (12 trial for each class). Only session 1 is used for processing and training, session 2 data is kept for testingpurpose.
Evaluation method
In BCI Competition IV for result submission a kappa Coefficient was defined such that κ ∈[0, 1]. The value of the kappa coefficient can be calculated using the following equation
The coding and simulation is done in Octave 4.0.0 environment. In addition, the biosig toolbox is used to load the EEG data (.gdf) in octave. The whole experiment was carried out on standard computer configuration with intel core i7 CPU processor and 4 GB RAM.
Results
Implementation starts from preprocessing the EEG data, the implementation contains both the cases, one is average weighted and the other is optimally weighted using the meta-heuristic algorithms as discussed before for trial selection over the trials of motor imagery task. Two feature extraction techniques are used, one is band power and the other is time domain parameter. Two classifiers (LDA, SVM) are applied to further classify the result. The steps are described below.
Pre-processing
The EOG channels are removed. These channels are ignored from further processing as they contain eye activity, only the channels relating to the brain activity are considered. Not only do the signals contain the artifacts relating to the eye, but also other kinds of artifacts like the higher frequency components (More than 30 Hz) are also present. To remove them, the Butterworth band pass filter that only allows 7–30 Hz frequency components (smoothing window size is 1 sec) is used. Rest frequency components are filtered out.
Trimming
As mentioned in the dataset, each trial was recorded in 8 second paradigm. In which initially black screen was shown to the subjects. And then short acoustic warning tone was presented to make them alert. Then after 2 seconds the motor imagery task to be performed was shown, 1 second for the subject to think about the motor imagery task and then to imagine it in his mind. After 6 second again the subject was allowed to rest. It was assumed that the imagination of the motor task began after 3 seconds and ended at 6th second. The signals other than that are trimmed off so that better accuracy in classification is achieved.
Multi-trail selection
Trimmed signals are provided by the previous preprocessing step. In context of the data set there are 72 trials for each class for a particular subject. Because trials are recorded in two sessions for 9 subjects one is for training purpose and the other for testing purpose. So in training session for each subject 6 runs are recorded and in each run 48 trials are recorded (12 trials for four class). So there are 288 trials for 4 classes and for one class there are 72 trials for each subject. The trimmed signal matrix for each trial is of size 1500×23 where 1500 is the total number of frames (recording frequency is 250 Hz and the signal is trimmed for 3 second (3–6 sec) i.e. 250×3 = 750 samples for 2 sessions for a total of 25 channels in which 3 EOG channels are removed leaving only 23 channels. Now the quality of the trial is found by FFDIAG algorithm and then the weight vector is optimized with the help of meta-heuristic algorithms as discussed previously. The population size in meta heuristic algorithm is varied from 100–250 and the accuracy for all the features are given in Table 1 (In Table 1 each result is run 5 to 7 times and then the average is taken to overcome the randomness effect). So from the table it is concluded that after an initial population size of 200 the results converge and does not change too much. So initial population is set to 200 and all meta-heuristic algorithms are allowed to optimize the fitness function for 1000 iterations and the convergence curve is shown in theFig. 2.

Convergence curves for Subject 2.
Mean k value for different initial population with fixed iteration size (200)
From the plot in Fig. 2 and results it is observed that gravitational search algorithm in all classes for each subject does not optimize the weights that efficiently but the firefly algorithm optimizes the function much better and also converges quickly. The performance of the cuckoo algorithm is somewhere in between both the algorithms. It should be noted here that GSA and firefly converge in between 300 to 400 iterations but cuckoo algorithm converges much faster.
The optimized weights of the trials for different subjects for each class are obtained. So after applying these weights on the trials, features are extracted and the classifiers are applied. Obtained classification accuracy in term of κ(kappa) and percentage for both training and testing data are given in Tables 3 and 4 respectively. Besides the best classification accuracies for individual subjects are presented in Table 2 along with the feature and the classifier for which the best results were obtained. From the Table 2 it is observed that Firefly Algorithm for weight selection, Band power feature and SVM as classifier yields the best accuracy. Figure 3 gives the plot for the average kappa value for the subjects. The accuracies for the training data have been refined with the help of 8-Fold cross validation and are presented in Table 3 and for testing data are presented in Table 4. It is observed that our proposed method gives better accuracy in comparison to the other reported methods such as average based, ADMM based and error based for selection of trail for motor imagery taskclassification.

Average Classification Accuracy in kappa with Band power as feature and SVM as classifier for all subjects.
Best Classification Accuracies for the subjects with Firefly Algorithm for weight Optimization, Band Power as Feature and SVM as Classifier
Average classification accuracies for various features using LDA and SVM as classifier in terms of kappa (k) and percentage (%) using 8 fold cross validation for training data
Average classification accuracies for various features using LDA and SVM as classifier in terms of kappa (k) and percentage (%) for testing data
This paper investigates the performance of intelligent search algorithms in selecting the prominent trails to be used for motor imagery task classification and further two types of classification methods have been applied. It has been observed that all three meta-heuristic algorithms have performed better than all other methods reported earlier. However, amongst the three algorithms firefly algorithm gives the best performance in terms of faster convergence, and better quality of solutions. These optimized weight vectors provide the most prominent trials for each subject to be used in feature extraction via CSP, so that the most discriminative features from multiple trials of a subject is extracted, followed by classification by two methods, LDA and SVM. The classification accuracy is tested for different features and different classifiers. From the results it can be concluded that among all weight optimization methods for trail selection, firefly algorithm always gives the best results with band power as feature and SVM as the classifier. Further, this work discussed only about the weight selection methods of the trials and thus this idea can be extended to individual frames in a single trial so that only those frames which contain the most discriminative information can be selected and not the other frames of the trials.
Footnotes
Acknowledgments
The above work was supported by the Ministry of Communications & Information Technology No. 13(13)/2014-CC& BT, Government of India.
