Abstract
Social networks or social media is an online platform for billons of people around the world. This platform makes it easier for the people to have conversations, share information, share videos, instant messaging, create virtual world and more. The most dominant form of interaction on social media is by the text messaging. To detect the emotions from these text messages is not a difficult job for the humans as they are linked with emotions themselves. But to detect the emotions from these text messages by the computer is a difficult job to perform. Various models like fuzzy model, vector space model, keystroke dynamics, character n-gram models etc have been proposed in the literature for the detection of emotions but every model has its limitations and drawbacks. In this study a novel K-RCC (Reduced Computational Complexity) emotion detection model is proposed which is based on the K Nearest Neighbor (KNN) algorithm. The K-RCC algorithm reduces the computational complexity and incorrect classification rate which is the main drawback of the KNN algorithm. The computational complexity of the KNN algorithm is reduced up to some extent by the K-d Tree algorithm but on the cost of increased incorrect classification rate. The systematic performance analysis of K-RCC is carried out with four Machine learning classification algorithms for the detection of human emotions from tweets collected from social media site twitter. The emotions are classified under six emotional classes such as disgust, fear, joy, sadness, anger, and shame. The K-RCC performs better both in terms of reducing the computational complexity and incorrect classification rate and detection of human emotions.
Introduction
Emotion is a mental state that arises spontaneously rather than through conscious effort and is often accompanied by physiological changes; a feeling: the emotions of joy, sorrow, reverence, hate, and love. The physical form of emotion may be outward and evident to others, as in crying, laughing, blushing, or a variety of facial expressions. However, emotion is not always reflected in one’s appearance and actions even though psychic changes are taking place. Joy, grief, fear, and anger are examples of emotions. The different ways by which emotions can be specified are through speech, facial expressions, and textual information available on the social sites. Text is the widely spread form of communication on social networks nowadays.
The emotional characteristics extracted from this textual data are very useful in many areas such as analysis sentiment, text to speech generation, and better interaction between computer systems. From an applicative point of view automatic emotion detection from texts expressed by a single word or a group of words is becoming increasingly important. Classification of emotions allows us to identify the feelings or information of users for variety of purposes or towards a specific event. The text entered by a user in a blog, online chat site, or in another form detects his emotional behavior state.
The numerous components that constitute the complex cognitive reactions are bodily changes, thoughts and behavior. On the bases of interaction of these components various models have been proposed, but no single formulation that is universally acceptable exists at present. A great deal of interest has been emerging from the field of human-computer interaction. To design a system that can identify emotional states automatically is the main object because modeling emotional states is very challenging problem which would revolutionize applications in entertainment, education, safety, medicine, etc. The first step in modeling any phenomenon is data collection and machine learning process. Machine learning or cognition is the process of acquiring and understanding knowledge through our thoughts, experience, and senses. Whenever you see or hear something new, you go through series of cognitive processes, which are the processes that result in learning. The ability of computers to learn without being explicitly programmed is a learning called artificial intelligence. This machine learning focuses on the development or building of computer programs that can teach themselves to grow and change when exposed to new data. A problem can be modeled in different ways by an algorithm which is based upon interaction, experience, environment or the data which is input. There exists only a few main learning models or styles that an algorithm can adopt or suit with the type of the problem.
The taxonomy or way of organizing machine learning algorithm is necessary to get the best results, because for the best model selection it forces you to think about the roles of input data and selects the most appropriate model for your problem.
According to the past there is a certain increase of using social networks among young people. Social networking allows people to stay in touch over long distances. As it is a fun way for people to communicate on the internet. Though all the negativities, it is used in variety of applications. It is also a way to socialize with friends and family that may be located in another part of the world. If used correctly its negativities turn out its positivities. By using social media citizens are able to give a say and the government is able to make innovations that can make the state or even the country better. Without it however, our decisions are only made by the government and country of freedom is more of a cage. For the emotion analysis the social media such as Facebook and twitter are the most popular micro blogging sites. A wealth of information or data is created about what the people are thinking or feeling. According to Washington post [1] Twitter have more than 200 million active users and generate 400 million tweets every day. As the social networking sites has appeared relatively recently few research works have carried out to develop methods or techniques to analyze the emotional dynamics from this wealth of information or datasets. Various tailored methods have been proposed for reducing the computational complexity of the KNN algorithm which are categorized into acceleration based methods and instance selection based methods. Social network being a large scale computing problem acceleration based methods may not be more effective for improving the efficiency of the KNN algorithm as compared to instance selection based methods. TrigueroI, Peralta D and Bacardit J, et al. 2015 [2] in “MRPR:A Map Reduce solution for prototype reduction in big data classification” proposed a novel distributed partitioning methodology for proto type reduction techniques in nearest neighbor classification to speed up the classification process and reduce the storage requirements and sensitivity. Zhai J H, Wang X Z and Pang X H 2016 [3] in “Voting-based instance selection from large data sets with MapReduce and random weight networks” proposed an algorithm based on voting mechanism for choosing a subset of a large dataset by removing redundant data instances. The purpose of the instance selection is to reduce the computational complexity of classifier algorithm with little or no performance deterioration. Zhai J H, Li T, Wang X Z 2016 [4] in “A cross-selection instance algorithm” two instance selection algorithm are proposed which are motivated by cross-validation technique. The proposed algorithms are much more feasible and effective in selection of instances from the big data sets. Enrique Leyva, Antonio Gonzalez, Raul Perez 2013 [5] proposed three new instance selection methods to reduce the number of instances in the training set without affecting the classification accuracy. Alvar Arnaiz-Gonz
Our approach in this paper is to analyze the text for classification of emotions using novel approach. The classification is based on the six different classes of emotion such as disgust, fear, joy, sadness, anger, and shame.
Paper Organization: The remainder of this paper is organized as follows: Section 2 introduces the back ground research of the related work in this area. Section 3 describes the motivation and contribution for the proposed approach. Section 4 shows the proposed algorithm and flow chart. Section 5 describes the system model of our proposed algorithm. Section 6 explains the mathematical model and evaluation of the proposed algorithm. Section 7 describes the experimental setup and result analysis. Section 8 concludes the paper. Last section is the reference section.
Back ground research
Emotions Researchers get fascinated by emotions because a large volume of the research literature is available in the field of linguistics, social sciences, psychology, and communication. Human emotion are evident itself in many forms they can be of speech utterances, actions, facial expressions, text writings and in gestures. “Passions of the Soul” is among the earliest works to theorize emotions published by the French philosopher Ren
Lexicon/Corpus based methods (based on dictionaries of terms with associated emotion orientation)
The Circumplex Theory of Affect (Watson and Tellegen, 1985) [12] identifies two main dimensions of positive and negative effect, which range from high to low. Johnson-Laird and Oatley (1989) [13] have deduced basic emotions by analyzing 590 English words, which describe emotion. Emotive meanings of the words are assigned by Osgood’s theory of Semantic Differentiation (Osgood et al., 1957). Depending upon the context of the words (Clore et al., 1987) in the text emotions may be classified as explicit (direct affective words) or implicit (indirect affective words). Strapparava and Valitutti (2006) [14] have classified words into and ‘indirect affective words’ categories. Kaveh Bakhtiyari & Hafizah Husain [15] discusses a fuzzy model for multi-level human emotions recognition by computer systems through keyboard keystrokes, mouse and touch screen interactions. Jianhua Tao et al. [16] generated emotion estimation net (ESiN) to estimate the final emotion by combining the emotion functional words (EFWs) and content words. Text from a spontaneous speech corpus was used to obtain relatively good results.
Changqin Quan and Fuji Ren 2010 [17] a manual annotation model for emotion detection is proposed. In this work Chinese emotional expression analysis is carried out by using blogs as objects and data source. Chung-Hsien Wu, Ze-Jing Chuang, and Yu-Chung 2006 [18] in this study The emotion association rules (EARs) represented by semantic labels (SLs) and attributes (ATTs) for each emotion are automatically derived from the sentences in an emotional text corpus using the a priori algorithm.
Rada Mihalcea Hugo Liu [19] in this model ‘linguistic ethnography’ is applied to seek out where happiness lies in our everyday lives by considering a corpus of blogposts from the LiveJournal community annotated with happy and sad moods.
Sentence level emotion identification method proposed by Dipankar et al. [20] a mechanism which uses the classifier Conditional Random Field (CRF) for the classification of the words. It groups the emotions into the six emotion tags and one neutral tag.
Saima Amanl and Stan Szpakowiczan (2007) [21] presented an annotation scheme for identifying the emotion category emotion words and emotion intensity. Ubeeka Jain and Amandeep Sandh (2015) [22] in their Review paper on the Emotion Detection from Text using Machine Learning Techniques has summarized various works like Obdal and Wang (2014) a novel approach presented for emotion detection from Chinese language. Fine grained segment based supervised learning method for emotion detection from subjective sentence or text represented by hidden variable. Ho & Cao (2012) exploited the idea that due to some emotional events human mental states are caused. This means that due to occurrence of certain events human mental state moves from one emotional state to another state. Ghazi et al. (2010) studied six Ekman emotions by hierarchical classification. The hierarchy was from sentence level to word level for positive and negative sentiment.
Probabilistic and statistical methods
Sophie F Waterloo, Susanne E Baumgartner, Jochen Peter and Patti M Valkenburg 2017 [23] the norms of expressing emotion is examined or investigated across four different social media platforms facebook, twitter, Instagram and WhatsApp. Findings provide information of platform differences of gender and age and also contribute to a more informed understanding of emotion expression online.
Huiqun Zhanget et al. [24] proposed a Socioscope model for the human behavior analysis on social networks. Probabilistic and statistical methods were used for detection human behavior changes.
Jeremie Clos, Anil Bandhakavi, Nirmalie Wiratunga, and Guillaume Cabanac [25] in this work a technique to predict the emotional impact of news on its consumers is presented. Emotion distribution of a particular post is learnt from emotion lexicon in first step and in the second step a multi-linear regression model is used for emotion distribution as prediction.
Yoad Lewenberg, Yoram Bachrach and Svitlana Volkova 2015 [26] in this study a relation between the emotion and their perceived areas of interests are examined. A correlation model is used to build a machine learning model that can predict the interests of a user given the emotions they express in their social network profile.
Zhiwen Yu, Fei Yi, Chao MA, Zhu Wang, Bin Guo, and Liming Chen [27] in this work two dimensions of emotion behavior were defined first emotion orientation and emotional influence. MERM model is used to identify emotion roles by introducing five different facts.
Supervised machine learning methods (which need training data)
Nearest neighbor algorithms are the traditional and most popular methods for classification. K-nearest-neighbor algorithm classifies the unseen data point according to its k nearest neighbors’. The major difficulty in the k nearest neighbors classification is that how to reduce the computational cost and misclassification rate. There are some methods used for but there is no method dominating the literature. The accuracy of the K Nearest Neighbor algorithm mostly depends upon the two parameters one is the distance function and the other is the value of k parameter. The different distance function used in the literature for calculating the distance between two points are Euclidean distance, City Block (Manhattan) distance, Minkouski distance, Chebyshev distance, Canberra distance, Bray Curtis distance (Sorensen distance). The other parameter is the selection of the k value which controls the volume of the neighborhood and consequently smoothness of the density estimate. Different approaches where presented in the literature.
Niko Colnernc and Janez Demsar 2018 [28] the main aim of this work is to improve the performance of lexicon based method and simple classifiers bag-of-words models. Investigation of transferability of the final hidden state representations between different classifications of emotions, and whether it is possible to build a unison model for predicting all of them using a shared representation.
Miloud-Aouidate Amal and Baba-Ali Ahmed Riadh (2011) [29] suggested several condensing techniques to reduce the computational complexity of the KNN algorithm. Some of the condensing techniques suggested where CNN (condensed nearest neighbor rule), RNN (reduced nearest neighbor rule), FCNN (condensed nearest neighbor rule). The efficiency of the computational time was improved but it fails to mini-mality their resulting sets. In order to improve the solution of these algorithms hybridizing technique called modern heuristics or meta-heuristics where used. Muhammad Arif, Muhammad Usman Akram and Fayyaz-ul-Afsar Amir Minhas (2010) [30] proposed a Pruned Fuzzy K-nearest Neighbor (PFKNN) to classify six types of beats in the dataset. Due to large number of training sets Fuzzy KNN cannot be implemented because it requires large storage and is time consuming. Arif-Fayyaz pruning algorithm especially suitable for FKNN which can maintain good classification accuracy with appropriate retained ratio of training data. Hamid Parvin, HoseinAlizadeh and Behrouz Minaei-Bidgoli (2008) [31] presented a Modified K-Nearest Neighbor algorithm. This method was evaluated on five different data sets. The main idea was to classify the test samples according to their neighbor tags. The low accuracy rate of the KNN algorithm is addressed in this paper. Yingquan Wu, Krassimir Ianakiev and Venu Govindaraju (2001) [32] in their paper presented two techniques namely, template condensing and pre-processing, to significantly speed up k-NN classification rate which was otherwise very expensive in the traditional KNN implementation. This technique attempts to sparsify the dense attractive areas and is efficiently implemented by gradually eliminating the prototypes with high attractive capacity. The redundant prototypes are discarded while keeping useful ones. Ruiqin Chang, Zheng Pei and Chao Zhang (2011) [33] proposed a new editing k nearest neighbor rule by editing the reference list which consists of subsets of the reference set and testing set. The performance of the new algorithm is compared with K-NN, EK-NN and MEK-NN. In comparison MEKNN finds more reasonable training set and classification accuracy but on the expense of run time as compared with the EKNN. Advantages of MEKNN method are to reduce the loss of information and improve the recognition rate. T. Priyanka and N. Narasimha Swamy [34] proposed a KNN Based Document Classifier Using K-d Tree: Due to higher dimensionality of the document the efficiency of the text classification is very less especially when the corpus contains many noisy or irrelevant term features. In order to overcome these problems a new approach of KNN and k d-tree are combined and implemented in the CenKNN classifier in two stages. The performance measured by F1-score on datasets, has shown its effectiveness and reduced computation time. Another modification to the K d-tree search algorithm was proposed by Rina Panigrahy (2008) [35] for the improved performance. The approximate solutions where presented. A c-approximate nearest neighbor is any neighbor within distance at most c times the distance to the nearest neighbor. The traditional Kd-tree search algorithm has a very low probability of finding an approximate nearest neighbor; the probability of success drops exponentially in the number of dimensions. Mohammad RezwanulHuq, Ahmad Ali and Anika Rahman (2017) [36] introduced a new functional KNN algorithm called sentiment classification algorithm (SCA). The algorithm was implemented on the twitter micro-blogging data. The performance of the algorithm was compared with the other machine learning algorithm support vector machine (SVM). In this paper, the main focus was on dividing the tweets into positive and negative sentiment. In this work, the KNN based sentiment classifier algorithm (SCA) performs better than SVM with a great future work. Gune’s Erkan, Ahmed Hassan, Qian Diaoand Dragomir R. Radev [37] presented a new nearest neighbor methods for text classification. Improvements in classification where shown in K Nearest Neighbor (KNN) algorithm by replacing the classical cosine similarity with a KL divergence based similarity measure. A semi-supervised extension of the KNN was formulated that is equivalent to semi-supervised learning with harmonic functions. Competitive results were produced when compared with other state-of-the-art machine learning methods such as Support Vector Machines (SVM) and Transductive SVM. Amit Chauhan, Anil Saroliya, and Varun Sharma (2015) [38] presented a modified KNN (K-nearest neighbor) algorithm designed to organize Post and Pre driving fatigue levels of vehicular Drivers using pulse oximetry signals. Two versions of KNN algorithm are compared for performance analysis. In order to find the various levels of physical and mental fatigue of vehicular drivers physiological and Psychological parameter measured to develop an insidious solution for eliminating accidents.
An emotion classification of web blog corpora was proposed by Yang et al. [39]; they used Conditional Random Field (CRF) and Support Vector Machine (SVM) machine learning techniques.
Bincy Thomas et al. [40] introduced an approach to emotion classification from sentences. Supervised machine learning with bag of words technique was introduced. ISEAR dataset was considered and tested for various classifiers. Weighted log likelihood score (WLLS) is used for the selection of different feature sets.
Several works have been reported in the literature on the study of emotion expressions using different approaches like text analysis with mathematical algorithms, emotion recognition methods using supervised machine learning algorithms, keystroke analysis and mouse movement [41], identifying emotions by keystroke dynamics and test pattern analysis [42], ECG and EEG pattern analysis [43–45] etc. The use of creative words for communication on social media is necessary because the written text lacks facial expressions, tones, gestures, which is evident in other forms of communications. For this reason the machine learning algorithms are used which are best suitable for the prediction and classification analysis for detection of emotions using text as input.
Few works have been reported in the literature for the study and deign of new KNN algorithm. The main goal is how to reduce the computational cost complexity of the algorithm. Several condensing techniques have been proposed to reduce the computational complexity of the KNN algorithm. Different modified reference list approaches are suggested. KNN with K-d tree are discussed. Sentiment classification algorithm is proposed for analysis of sentiment on the social media sites. Besides these approaches some hybrid algorithm approaches are also discussed.
From the above literature survey, it is evident that no method for designing K Nearest Neighbor algorithm dominates the literature. The various methods discussed above have their advantages and disadvantages. Based on the drawbacks like large space requirement, high testing computational cost, high misclassification rate etc., of these methods our main thrust in this study is to reduce this computational cost complexity of the KNN algorithm.
Motivation and contribution
The main motivation factor for this work has come from the recent growing interest in the field of emotion analysis due to the increased online communication on social networks keeping trust, security and privacy issues in consideration [46, 47]. Academic and industry researchers are increasingly attracted by the social networks and may have integrated these sites into their daily practices. In order to understand the implications, practices, culture and meaning of the sites researchers from the different fields have examined social networks and user’s engagement within them. Emotions are the root cause of many mental diseases which increases the risk of developing health problems like heart diseases, mental disorders, anxiety disorders, Adult Attention Deficit, bipolar disorder etc. So detection of emotions has great effect on human health and life quality.
With the rapid development of social networks detection of emotions becomes feasible from tweets which contain text, visual content, images, share opinion, etc. that may indicate emotion related symptoms. As mentioned in the background study it is evident that no method dominates the literature because of various drawbacks like large space requirements, high computational complexity and high misclassification rate. To overcome these drawbacks and propose a new method with reduced complexity is the main motivational boost for this work. As millions of user’s have integrated to these sites the main motivation factor was to derive the dynamic human behavior of the user’s from these micro-blogs. As a large volume of data is available on social networks it is really very challenging and interesting to find the one’s behavior and actions which influence connections between two strangers. The motivation came by how we can study the insights into certain behaviors as data available on social networks is huge for example noting behavior in upcoming elections, product taste, better employee employer matches, growth of small businesses, create new businesses and security related information exchange.
The quality of life can be measured based on different aspects of life including social, emotional, psychological, life satisfaction, and work. The quality of life and measuring and tracking the living conditions of society which help the social scientists for public policy making.
The sentiment of buyers can be gauged which help the commercial agencies for promotion of the new products and market research.
Typically, research using online social networks involves collecting of very large amount of information analyzing it from different vantage points and proposing and testing models to explain the dynamics of certain behaviors’. The social media can influence our own behavior, actions, and habits which are out of imagination. In this research work we do believe social network have become ideal “laborites” for social scientists who want to study the human behavior.
The contribution of this work is to propose and implement a novel approach K-RCC algorithm based on the K Nearest-Neighbor machine learning algorithm. The biggest advantage of the K-nearest-neighbor classifier is that it is very simple and easy to understand. But the downside of the classifier is that its computational cost and misclassification rate is high. The second motivation point of this work is derived from this disadvantage of the K-nearest-neighbor algorithm and efforts are made to reduce the computational cost and the misclassification rate of the K-nearest-neighbor classifier. Our main aim in this study is to identify user emotion by comparing four classifier algorithms namely Naïve Bayes, J48, K Nearest Neighbors (KNN) and Support Vector Machine (SVM) with our proposed algorithm. We have worked on six basic emotional classes – anger, disgust, fear, joy, sadness and shame as per the basic emotional categories of the International Survey on Emotion Antecedents and Reactions (ISEAR).
Proposed flowchart and algorithm
Figure 1, shows Flow chart of proposed algorithm and Fig. 2, shows algorithm steps of proposed algorithm.

Shows Flow chart of proposed algorithm.

Shows algorithm steps of proposed algorithm.
The main advantage of the KNN is that it is a simple classifier; it can be used for both classification as well as regression. But the downside is that it is computationally expensive as compared to other methods. Its testing complexity is O (n * d) where ‘n’ is the number of training features. The issue of testing complexity of the KNN can be addressed by either reducing the dimensionality‘d’ or by reducing the number of training features. To achieve this K-D Trees and Fingerprinting techniques are used, but the disadvantage of these approximations is that they miss the nearest neighbours.
This drawback of these algorithms is overcome up to a great extent in our new K-RCC algorithm approach discussed in the next sections. As shown in Fig. 3 the whole training feature space is divided in to four symmetrical regions.

Two dimensional training feature space.
Here we are considering two dimensional space model the dimensions are X and Y. The two medians m1 and m2 are drawn on the X and Y dimension’s respectively. The median intersection point M is obtained which in turn will give us a starting point in two dimensional feature space. Once the median intersection point M is obtained the Euclidian distance between the M and the testing feature (TTF) is found which is D unites. The D distance is divided in four equal parts such as D/4, D/2, 3D/4 and D. At this stage we know the position of the given TTF, M coordinates and the distance between them. The value of ‘r’ is assigned as D/4 units and a region R1is selected taking the centre as Testing Feature (TTF) as shown in Fig. 3. It will be the smallest region with r unit’s radius. Now the question is how many training points will fall in the region R1. If the number of the training features in this region is equal to or greater than the value of K then we can take the decision about the classification by taking the majority vote. Let’s suppose the value of K is five. If the number of the training features in the R1 region are equal to or greater than five we can calculate the Euclidian distances of all the testing features in the R1 region, by majority of vote and label is assigned to the testing feature.
Now the case is if the number of the training features in the R1 region is not equal to or greater than the value of K, we need to increase the number of the training features in the region R1. This can be done only by increasing the value of ‘r’ which is r = (D/2) to form region R2. The area of the region R2 shown in Fig. 3 is now more than the region R1 and hence the number of the training features will increase.
Again if the numbers of the training features in region R2 are equal to or greater than the value of K then we can take the decision about the classification otherwise value of radius ‘r’ will be incremented by (D/4). The value of radius r = (3D/2) this time, the region R3 is formed having the area greater than the region R2 and the number of the training features in the region R3 will be increased. Again the condition for the training features in region R3 will be verified with the K value and the decision about the classification of the testing feature will be taken and the algorithm will stop.
Otherwise the value of the radius will be increased again by (D/4)units which turn the value of r = D this time, the r = D gives us the region R4 having maximum area size accommodating maximum number of the training features.
It is quite obvious even in case of medium or small training data sets the condition for K value will be met in this case as region R4 is having maximum area size accommodating maximum number of the training features. After meeting the conditions the feature coordinates P1′, P2′, P3′ and P4 of the region R4 are calculated. The feature coordinates will give us the maximum and minimum values of the region R4. The feature coordinates P1′, P2′, P3′ and P4′ can be arrived by adding the r unit distance (which is known) to the testing feature (TTF) X, Y dimensional coordinates (which are known) at particular angles. The feature coordinates for P1′ = (x - r, y) at an angle 180°, P2′ = (x, y - r)at an angle 270°, P3′ = (x + r, y) at an angle 0° and P4′ = (x, y + r)at an angle 90°. After finding the feature coordinates P1′, P2′, P3′ and P4′ our main goal is to find the region of interest (ROI). The ROI will carry only those training features which are of our interest and the rest will be discarded. This is shown in Fig. 9. The decision about the training features which one are kept and which are discarded from the dimensional space will be decided only when we know the x and y coordinate of the feature points of P1, P2, P3 and P4. The feature point P1, P2, P3 and P4 will give us the maximum and minimum (x, y) coordinates of the region of interest area. So the training features below and above these values should be discarded which will minimize the computational complexity.
As shown in the Fig. 4 the ROI will not accommodate all the training features for calculating the Euclidian distance. The Euclidian distance will be calculated for only those training features which fall under ROI area therefore reducing the computational complexity.
Shows Region of Interest.
Otherwise the computational complexity will be very expensive as it has to compare the distances between all the training features. For a data set of size ‘n’ it will require ‘n’ computation of distances and each computation of distance will take‘d’ operations. Therefore, the complexity of testing time will beO (n * d). The feature points of P1, P2, P3 and P4 can be arrived at by inter changing the coordinates of P1′, P2′, P3′ and P4′. P1 will be formed by taking the coordinates of P1′ and P2′. The x coordinate of P1 will be the x coordinate of P1′ and y coordinate of P1 will be will be the y coordinate of P2′.
P2 will be formed by taking the coordinates of P2′ and P3′. The x coordinate of P2 will be the x coordinate of P3′ and y coordinate of P2 will be will be the y coordinate of P2′.
P3 will be formed by taking the coordinates of P3′ and P4′. The x coordinate of P3 will be the x coordinate of P3′ and y coordinate of P3 will be will be the y coordinate of P4′.
P4 will be formed by taking the coordinates of P4′ and P1′. The x coordinate of P4 will be the x coordinate of P1′ and y coordinate of P4 will be the y coordinate of P4′.
Our algorithm is advantageous in both ways 1st it reduces the number of training features so the cost of the computational complexity is reduced.
Labels used
The Fig. 5 above depicts the mathematical model for the proposed K-RCC algorithm. A two dimensional space model X and Y is considered with the input training feature (TF) space set x i and testing feature (7,4) labeled as (TTF). Two medians m1 and m2 are calculated on X and Y dimensions to obtain a median intersecting point “M” which is (6,8) in this case. If the median intersection point M is equal to testing feature (TTF) then add the half of the X,Y coordinates of M to M so that to new coordinates are arrived. Euclidian distance “D” is calculated between testing feature (TTF) and median intersecting point “M”. The D distance is divided into four equal parts D = {D/4, D/2, 3D/4, D}. Region R1 is obtained by taking radius r = D/4 units with center as TTF in the training space. Four Feature training points P1′, P2′, P3′ and P4′ are obtained at angle 0°, angle 90°, angle 180° and angle 270°.

Shows Mathematical Model.
Region of Interest (ROI) is obtained by interchanging the coordinates of P1′, P2′, P3′ and P4′. Then we count all the number of training features under ROI. If number of training features under ROI are less than K value increment r = (r + D/4) and find new ROI. Otherwise calculate Euclidean distance measurement between the TF and TTF in the ROI and discard the other TF. Obtain final feature label by majority of votes. Assign the final feature label to TTF and classify.
Assumptions: Let ‘K’ be a positive integer. K = 5 Assume ‘C’ as the number of classes, where C ≥ 2 Input Training Feature (TF) set x
i
= (1,9), (2,3), (2,6), (3,11), (4,1), (3,7), (4,6), (5,4), (6,8), (5,7), (7,2), (8,8), (7,9), (9,6), (10,3), (11,4), (12,9) Number of Training Feature’s = 17. Testing Feature (TTF) = (7, 4). Assume X and Y as two dimensional spaces.
As per the KNN algorithm the first step is to compare the testing feature with all of the training features. But why it is necessary, why don’t we compare the training features around the testing feature.
When we look at Fig. 6 the training features (a, b, c) around the testing feature TTF are the nearest neighbours of the testing feature TTF. We don’t compute distances between the testing feature points around. The reason for this is that we humans has a feature in our head called visual cortex it is oh some on picking certain things for example approximation of objects. We can pick out the nearest neighbours from two dimensional or three dimensional spaces easily.

Shows what we see?
But the machine on the other hand side sees the same data features as two coordinate dimensions training points like (1,9), (2,3),...,(12.9).The testing point is {7, 4}.
Median calculation
Median on X Dimension.
Where ‘n’ is the Number of Training Feature’s = 17
median = (17 + 1)/2th term.
median = (18/2) th = 9th term which is (6, 8).
Draw a median line on X dimension called m1.
Similarly draw a median line on Y dimension called m2.
Both the medians m1 and m2 will intersect at point M.
So we got the point M (6, 8).
Check if the testing feature (TTF (7, 4))and the intersecting point M (6, 8)are same, shift the median point M by adding the half coordinate values to M to find the new median intersecting point M like M (6, 8) = M (6, 8) + (M (6, 8)/2) (dividing coordinate values and adding to original coordinate values) which will be (M (9, 12)) in this case.
This condition is not true for this case; here we have different values of M and TTF.
M = (6, 8), TTF = (7, 4)
Euclidian Distance Formula:
Where x1 = 6, y1 = 8, x2 = 7, y12 = 4.
Divide D in four equal parts.
Assign r = (D/4) orr = (1.03).
Region R1: - This region can be formed by selecting ‘r’ as radius and centre as TTP. Here r = (1.03) units.
After formation of the region, condition for the K nearest neighbour is checked. If the number of the training features in region R1 is less than the number of K nearest neighbors the value of ‘r’ is incremented by 1.03. r = (r + 1.03) , therefore, r = 2.06. With this radius region R2 is formed. Again the condition for K nearest neighbors is checked. If value of K is greater than the number of training features in the region R2 ‘r’is incremented by 1.03 to form region R3 r = (3D/4) Region R4 is formed when r = D and it is quite obvious that the condition will be satisfied in any region. That is the number of the training features will be more than the value of K.
Finding the coordinate points P1′ P2′ P3′ and P4′
In order to find the training feature’s which are kept and which are discarded is decided by finding the area under P1′ P2′ P3′ and P4′. The computational complexity of the algorithm will be reduced only when we select the training feature set having lees training features as compared to the original training feature set x i . The coordinate points P1′ P2′ P3′ and P4′ can be obtained by adding and subtracting the ‘r’ units distance to the x and y coordinates of testing feature (TTF) at different angles as shown below.
We know the x and y coordinates of testing feature (TTF) which is (x = 7, y = 4) and r = 4.
Point P1′’s x coordinate will be ‘r’ units less than the x coordinates of TTF at an angle 180° and the y coordinates will be the same as y coordinate of TTF.
Point P2′’s x coordinate will be same as x coordinate of TTF. And y coordinate will be ‘r’ units less than the y coordinates of TTF at an angle 270°.
Point P3′’s x coordinate will be ‘r’ units more than the x coordinates of TTF at an angle 0° and the y coordinates will be the same as y coordinate of TTF.
Point P4′’s x coordinate will be same as x coordinate of TTF. And y coordinate will be ‘r’ units more than the y coordinates of TTF at an angle 90°.
Thus, the coordinate points P1′ P2′ P3′ and P4′ are obtained. P1′ = (3, 4), P2′ = (7, 0), P3′ = (11, 4) and P4′ = (7, 8) After obtaining these coordinate points the next step will be to find the region of interest (ROI).
The Region of Interest (ROI) consists of feature points P1,P2, P3 and P4. These can be arrived at by inter changing the coordinates of P1′, P2′, P3′ and P4′. P1 will be formed by taking the coordinates of P1′ and P2′. The x coordinate of P1 will be the x coordinate of P1′ and y coordinate of P1 will be the y coordinate of P2′.
P2 will be formed by taking the coordinates of P2′ and P3′. The x coordinate of P2 will be the x coordinate of P3′ and y coordinate of P2 will be will be the y coordinate of P2′.
P3 will be formed by taking the coordinates of P3′ and P4′. The x coordinate of P3 will be the x coordinate of P3′ and y coordinate of P3 will be will be the y coordinate of P4′.
P4 will be formed by taking the coordinates of P4′ and P1′. The x coordinate of P4 will be the x coordinate of P1′ and y coordinate of P4 will be will be the y coordinate of P4′.
Thus, the coordinate points P1, P2, P3 and P4 are obtained as P1 = (3, 0) , P2 = (11, 0) , P3 = (11, 8) and P4 = (3, 8). This is the ROI for the measurement of the Euclidian distance for the classification. Only those training features (TF) will be retained for the Euclidean distance measurement which falls under the ROI (P1, P2, P3, P4) area. Training features whose x coordinates will range between 3 and 11 and y coordinates range between 0 and 8 will be selected. The rest of the training features (TF) will be discarded from the training features (TF) set. Therefore, the training features (TF) like (1, 9), (2, 3), (2, 6)... from the training features (TF) set will be omitted. Before calculating the Euclidean distance measurement between the training features and the testing features the number of the training features will be counted in the region of interest (ROI(P1, P2, P3, P4)).
As per the assumption of K value which is three in this case the ROI should have more than three training features in it. It is true in this case here the ROI is having ten training feature points which is greater than the value of K which is three.
The training feature set which will be omitted is:
{(1, 9) , (2, 3) , (2, 6) , (3, 11) , (4, 1) , (3, 7) , (12, 9)} .
The training feature set which will be retained for the Euclidean distance measurement is:
{(4, 6) , (5, 4) , (6, 8) , (5, 7) , (7, 2) , (8, 8) , (7, 9) , (9, 6) , (10, 3) , (11, 4)}
Euclidean distance measurement
Complexity of KNN algorithm
Training Complexity:
What the K nearest neighbour algorithm actually does is it compares the Training Feature one by one. So if there are ‘n’ numbers of the training instances it has to do ‘n’ comparisons. And each comparison will take‘d’ operations. Therefore, O (d) is the testing complexity of the algorithm which is very less.
Testing Complexity:
in case of the testing complexity if ‘n’ is the number of the training instances and ‘d’ is the dimensionality of the training space then the testing complexity will be O (n*d). Comparison of each all training instances with the testing instance which is very high.
Complexity of our algorithm
Training Complexity:
The training complexity will be the same as of KNN algorithm.
Testing Complexity:
The main challenge of our algorithm is to reduce the testing complexity. This can be achieved by either reducing the dimensionality ‘d’ or by reducing the number of training features. The reduction of the training features is possible only by finding the potential training features which are the nearest neighbours of the testing feature. In our approach region wise space is selected instead of whole training space. Let R is the whole training space as shown in the fig. (System model) then the region wise testing complexity can be calculated as follows:
Case 1.
Region under consideration R1. Where R1⪡R
Space under Region R1 is r = D/4.
Testing complexity will be O (d′* n′). Where d′ is the dimensionality of the training space and n′ is the number of the training features in the region R1.
If the condition n′> K is true the complexity will be minimum for the algorithm in this case.
Case 2.
Region under consideration R2. Where R2« R
Space under Region R2 is r = D/2.
‘n’ total number of the Training features in space R
Testing complexity will be O (d′* n′). Where d′ is the dimensionality of the training space and n′ is the number of the training features in the region R2.
n′ε n and n′⪡n.
If the condition n′> K is true the complexity will be more than as in case1 for the algorithm but still less than the KNN because R2⪡R.
Case 3.
Region under consideration R3. Where R3⪡R
Space under Region R3 is r = 3D/4.
‘n’ total number of the Training features in space R
Testing complexity will be O (d′* n′). Where d′ is the dimensionality of the training space and n′ is the number of the training features in the region R3.
n′ ε n and n′⪡n.
If the condition n′> K is true the complexity will be more than as in case 2 for the algorithm but still less than the KNN because R3⪡R.
Case 4.
Region under consideration R4. Where R4⪡R
Space under Region R4 is r = D.
‘n’ total number of the Training features in space R
Testing complexity will be O (d′* n′). Where d′ is the dimensionality of the training space and n′ is the number of the training features in the region R4.
n′ε n and n′⪡n.
If the condition n′> K is true the complexity will be more than as in case 3 for the algorithm but still less than the KNN because R4⪡R. This will be the maximum of our algorithm which is still very less as compared to the KNN algorithm.
Region wise computational complexity
Symbols used and meanings
Data sets
We worked on two data bases first one collected from the twitter social network using twitter Application development. For streaming the live tweets an Application Program Interface (API) was developed using Natural Language Tool Kit (NLTK) with Python coding. The twitter social media was selected for the dataset because of its popularity and richness in opinion and emotion content. The social media is an immense contributor to the research field. Social sites have huge members with rich experiences and credentials. The data base has approximately one lakh (l,00,000) tweets with user ID, Retweet_count, Tweet_Time, Tweet_Location etc, as parameters. In our experiment we have used approximately 10,000 tweets for emotion analysis purposes. The novel approach applied in this study highlights the strengths and robustness which is achieved by using the different data sets for training and testing of machine learning algorithms.
The second database used is International Survey on Emotion Antecedents and Research (ISEAR) dataset for text pattern and emotion analysis (Scherer and Wallbott 1994). This data was collected by a large group of psychologists directed by Klaus R. Scherer and Harald Wallbott (Salton et al. 1975). Experiences of a large number of people were collected for building the database in which situations where reported which include all the seven major emotions which are: joy, fear, anger, sadness, disgust, shame and guilt. The simulation experiments are conducting by using ISEAR data set.
Sample tweets and emotion sentences
Sample tweets and emotion sentences
Table shows the sample examples of the refined tweets and emotion sentences from (ISEAR) dataset.
Standard datasets were selected upon which classification algorithms were implemented to analyze emotions from text. For this purpose, we used two data sets namely International Survey on Emotion Antecedents and Research (ISEAR) dataset and live tweets data set from twitter social network.
After data collection various text classification techniques like data pre-processing, tokenization, stemming, stop word elimination and feature selection steps are followed. The text classification process is carried by using the Weka’s unsupervised filter attribute called string to word vector. It converts string attributes into a set of attributes representing word occurrence information from text contained in the strings. In data pre-processing the text are converted in to a clear word format representing a great amount of features. After pre-processing technique tokenization is done; exploration of the words in a sentence is the goal of the tokenization in which text is breaked into words, symbols, phrases or other meaningful elements called tokens.. In our experiment we used Weka’s core word tokenizer for tokenization. The next step is stemming. Stemming reduces the words into their stem or root word or base words. In English language many words can be reduced in to their stem word e.g., like drive, driving, drives, derived in to drive the root word. Moreover, by removing ‘s’ names can be transformed into stem word. In this experiment Weka’s core stemmer called iterated Lovins Stemmer algorithm is used. It is an iterated version of Lovins Stemmer. It stems the word until it no further changes. The next step is stop-word elimination. Words like ‘a’, ‘is’, ‘you’, ‘an’ (prepositions, articles, and pro-nouns) can be considered as stop-words. By eliminating such words helps to improve the system performance. The Weka’s core stop-words words from file is used for stop-word eliminator for this experiment. The last step in text classification is the feature selection (FS). In order to increase or improve the scalability, efficiency and accuracy of a text classifier feature selection step is important. It constructs a vector space model for all those features which are relevant for the classification and removes features that are considered irrelevant for the classification. There are number of advantages for this transformation procedure including smaller dataset size, computational requirements are smaller for the text categorization algorithms (especially those that do not scale well with the feature set size) and search space is considerably shrunk. Reducing the over-fitting tendency is another advantage of feature selection. In FS process words with highest scores are kept. The words score is determined according to predetermined measure of the importance of the word in the text sentence. The simplest scoring method used in this experiment is the TF-IDF ranking technique. This is very easy and efficient method for text mining and information retrieval. The term TF means the Term Frequency and IDF means the Inverse Document Frequency. In order to evaluate the importance of a word in a document or corpus this statistical method is used. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
In this experiment we are using Weka’s unsupervised filter attribute called string to word vector and selecting the properties TF-IDF as true.
For implementation and result analysis weka tool is used. WEKA 3.7.12 is an open source interface which is used for the classification. Weka (Waikato Environment for Knowledge Analysis) is most popular software suite for machine learning. Weka software suit is written in Java, and is developed at the University of Waikato, New Zealand. The Weka contains a large collection of tools and interfaces for easy access to its functionality. Algorithms for data analysis and predictive modeling, visualization tools together with graphical user interfaces are some examples supported by the Weka. Several standard data mining tasks are supported by Weka, more specifically are clustering, data pre-processing, regression, classification, visualization, and feature selection. The data which is available to Weka is assumed as a single flat file or relation where fixed number of attributes describe each data point (normally they are numeric or nominal type of attributes, but it can support some other type of attributes also). All the software techniques are predicated on this assumption. WEKA accepts datasets which are in the form of attribute relation file format (.arff). It can accept datasets in comma separated value (CSV) file format which can be converted into ARFF format.
Knowledge flow shown in Fig. 7 is Weka’s interface visualizing a data-flow. It can handle data either in batches or incrementally. The different components of the knowledge flow are Data sources, Data sinks, Filters, Classifiers, Clusters, Evaluators and Visualization. Steps to build the knowledge flow for this study are summarised below.
Knowledge flow diagram. Select an ARFF loader from Data Source tab and configure it. Select ClassAssigner object from evaluation tab to specify class attribute. By right clicking on the Data Source connect the Data Source and Class Assigner. For configuring the ClassAssigner right click and select the configure option. Select the cross validation from Evaluation tab. Connect the CrossValidationFolderMaker with output of ClassAssigner. Select all classifiers from classifier tab and connect. Connect all classifiers first by training set and then by testing set with CrossValidationFolder Maker. Connect all classifiers by selecting batchClassifier with ClassifierPerformance Evaluator. Finally, TextViewer and ModelPerformance Chart are selected from visualization tool bar. Connect all classifiers through text and threshold Data with TextViewer and ModelPerformance Chart.
After defining the problem and preparing the data we need to apply machine learning algorithms to the data in order to evaluate or compare the results. We select the data for training and testing the machine learning algorithms and access the performance which is called the test harness. The goal of the test harness is to be able to quickly and consistently test the algorithms against a fair representation of the problem being solved. In this experiment we are using two data sets for training and testing the machine learning algorithms. ISEAR dataset (International Survey on Emotion Antecedents and Research) is used for the training and twitter data set is used for testing the algorithms. The performance analysis of K-RCC algorithm is summarized in Table 6. The performance is compared with basic methods like K Nearest Neighbors (KNN), J48, Naïve Bayes, Support Vector Machine (SVM) algorithms and with newer methods like deep learning, neural network and fuzzy logic and knowledge based ANN. Cy Yam 2015 [48] in emotion detection using deep learning shows improved weighted accuracy 60.60%. But the main challenge is the balancing distribution of samples for each class which affect the overall performance of the emotion classifier. Shadi Shaheen et al. 2014 [49] in emotion reorganization based on automatically generated rules, the emotion representation is done based on syntactic and sematic structures and this representation is generalized by various ontologies, which is again the drawback of using the knowledge bases. EmtoTxt tool kit presented by Fabio Calefato et al. [50] using SVM detects emotions from input text and trains a custom emotion classifier from scratch, but it uses manually annotated data. Neural network and fuzzy logic model presented by Neeraj Kanger and Gourav Bathla, 2017 [51] improves the accuracy of the classification up to 90% but the model is based on only three class of emotions.
The simulation results report
The simulation results report
To evaluate the performance of our proposed K-RCC algorithm with regard to other state of art machine learning algorithms used for detection of emotions on social media live streaming tweets data set is used as a test set. Our experiment focused primarily on the computational complexity of the algorithms used. We report the results in Precision, Recall, TP rate, FP Rate, F-measure and TOC Area. The performance of the proposed K-RCC algorithm with regard to the existing work is summarized in the Table 8.
Class wise accuracy report
Result comparision with existing work
The reasons that distinguish our work with the other existing works are described as follows:
Various existing methods proposed in the literature are lexicon based which are restricted by their lexicons and more particularly they use static prior emotion or sentiment values of terms regard less of their contexts. In order to update the emotion strength assigned to the terms in the lexicon some algorithms have been proposed but they require to be trained from manually annotated corpora. The second reason with lexicon based methods is that they are fully dependent on the words or syntactical features that explicitly reflect emotions.
Third reason is the degree of flexibility is limited as in many cases emotion of a word is implicitly associated with the semantics of its context. The better performance is achieved by using machine learning methods. The main advantage is at no lexicon or corpus is needed as they have high flexibility in a particular domain and automatic labelling of training data is done especially in case where data is continuously changing. KNN algorithm is advantageous because of its simplicity and usage as it is used for both regression and classification. But the main downside of KNN is its computational complexity its testing complexity is O (n * d)where ‘n’ is the number of training features. The issue of testing complexity of the KNN can be addressed by either reducing the dimensionality ‘d’ or by reducing the number of training features. To achieve this K-D Trees and Fingerprinting techniques are used, but the disadvantage of these approximations is that they miss the nearest neighbours.
This drawback of these algorithms is overcome up to a great extent in our new K-RCC algorithm approach where the number of the training features are discarded which are out of the ROI hence reducing the complexity.
Ten fold cross validation approach used for training the model. This model is trained on all folds except one that is left out for testing the algorithm. This process is repeated so that each fold gets an opportunity at being left out and acting for test data set.
Lastly, the performance measures are averaged across all folds to estimate the capability of the algorithm on the test data. In this experiment six emotional categories are classified they are: Sadness, joy, fear, anger, shame and disgust. There are many standard performance measures used for the machine learning algorithms which are cost insensitive analysis, cost sensitive analysis and statistical analysis.
Some standard terms used under this analysis are:
TP (True positive) Rate: Measures the proportion of positives that are correctly identified. The TP rate of different algorithms used in this study are shown in the Tables from 7 to 11. The weighted average true positive rate for the K Nearest Neighbors (KNN), J48, Naïve Bayes, and Support Vector Machine (SVM) algorithms is 0.879, 0.896, 0.919 and 0.929 respectively. As compare to these the true positive rate of the proposed algorithm is higher at 0.952 which shows the better performance.
FP (False positive) Rate: It is the error in data reporting in which a test results improperly indicates presence of a condition which in reality it is not. The weighted average of false positive rate for the K Nearest Neighbors (KNN), J48, Naïve Bayes, and Support Vector Machine (SVM) algorithms is 0.027, 0.021, 0.008 and 0.007 respectively. As compare to these the False positive rate of the proposed algorithm is lesser at 0.006 which shows reduced error rate in data than other algorithms hence performs better.
Precision: It is the fraction of the documents retrieved that are relevant to the user’s information need.
The precision value of the proposed algorithm is better than the other algorithms in comparison as shown in Table 8.
The formula for calculating the precision is given below.
Recall: It is the fraction of the documents that are relevant to the query that are successfully retrieved. The recall value of the proposed algorithm is better than the other algorithms in comparison as shown in Table 8. the formula for calculating the recall is given below.
F-measure: F-measure is defined as the harmonic mean of two values that are precision and recall. This is also known as F1 measure because precision and recall are evenly weighted.
The precision, recall and F-Measure are not considered as cost sensitive analysis means. These measures don’t give the real performance measures for a classifier when the data is skewed. If our dataset is highly skewed for example 90% positive and 10% negative instances the classifier results will be skewed to words the positive instances. The classifier does not know the difference between the positive and negative instances. The classifier will blindly classify everything as positive so the accuracy is as high as 90%. But that is not true or appropriate accuracy of classifier the output is badly designed the precision, recall and F-Measure are not presenting the accurate value in this case. So ROC (Receiver Operating Characteristics) graph is a visualization tool by which we can tell in a cost sensitive manner that weather our classifier is accurate or not. Receiver operating characteristics or curve illustrates the performance of a binary classifier system by graphical representation as the discrimination threshold of this binary classifier is varied. The curve is created by plotting (at various threshold settings) the true positive rate (TPR) against the false positive rate (FPR).
In ROC graph the main goal is to have the curve lot to right upper corner towards 1(one). As shown in the graph Fig. 8 the value of the proposed algorithm is more towards 1 as compared to the other algorithms. We can also find the AUC (Area Under Curve) which is 95.6% in case of the proposed algorithm higher than all other algorithms in comparison.
Shows Comparative ROC of all Algorithm.
The comparative ROC of all algorithms is show in Fig. 8. In the ROC graph the main goal is to have the curve lot to right upper corner i.e., towards 1(one). In our case the K-RCC algorithm is having the value at 0.956 which is more towards 1(one) as compared to the other algorithms in comparison.
Statistical analysis
Statistical analysis
Kappa is defined as a normalized value of agreement for chance agreement
Where
P (a) = percentage agreement ; P (e) chance
agreement
Agreement is perfect if K = 1between the classifier and ground truth.
For chance of agreement If K = 0.
The mean absolute error (MAE) is a statistical quantity used to measure predictions of the eventual outcomes from large set of test data and is given by
An average of the absolute errors ei = | fi - yiis mean absolute error, Where fi = prediction and yi = true value.
RMSE measures how much error there is between two datasets. In other words, it compares a predicted value and observed or known value. The RMSE is evaluated by the equation:
Where Pi is predicted value, Oi is observed value.
The incorrect classification rate of all the algorithms is shown in Fig. 9. As compared with the other algorithms K-RCC has the lowest incorrect classification rate at 5.00%. The highest incorrect classification rate is for KNN algorithm at 10.58% followed by j48 at 9.99%, Naïve Bayes at 7.99% and support vector machine at 6.17%. This shows that our proposed algorithm K-RCC performs better than other algorithms in comparison.
Shows Incorrect Classification rate of all classifiers.
In this paper a novel K-RCC algorithm is presented which is based on the K Nearest Neighbor (KNN) algorithm. The main drawback of the KNN is its computational complexity. This drawback is addressed by the KRCC algorithm. Although K-d Tree algorithm is also used for this purpose but in K-d the problem of misclassification arises which is also reduced in our KRCC algorithm. The K-RCC reduces the computational complexity of the testing features by 58.24% when the value of K = 5 and 41.18% when the value of k = 9 as compared to KNN and K-d Tree algorithms. The region wise computational complexity is calculated. The misclassification rate is reduced from 10.58% to 5.00% when comparing the KNN and K-RCC algorithm. Two data sets are used one for testing and other for training. The training model is build by using ISEAR database. The testing is done on the Twitter dataset collected from the twitter social site. The four machine learning algorithms used for text classification for identifying the emotion classes are Naïve Bayes, J48, K Nearest Neighbors (KNN) and Support Vector Machine (SVM). Our experiments show the K-RCC performs better as compared to other machine learning algorithms at 94.9956%.
WEKA 3.8 is used for simulation which is collection of machine learning algorithms for data mining tasks. The result analysis is done by three methods called cost insensitive, cost sensitive and statistical analysis.
