Abstract
Social media is considered to be a vibrant area where millions of individuals interact and share their views. Processing social media text in Indian languages is a challenging task, as it is a well-known fact that Indian languages are morphologically rich in structure. On transferring such an unstructured text into a consistent format, the data is exposed to feature extraction method. In the huge corpora, information units i.e. entities holds the basic idea of the content. The main aim of the system is to recognise and extract the named entities in the social media twitter text. The proposed system relies on the proficient co-occurrence based word embedding models to extract the features for the words in the dataset. The proposed work makes use of text data from the Twitter resource in the Tamil language. In order to enhance the performance of the system, tri-gram features are extracted from the word embedding vectors. Hence, systems are trained using N-gram embedding features and named entity tags. Implementation of the system is using machine learning classifier, Support Vector Machine (SVM). On comparing the performance of the proposed systems, it can be seen that glove embedding shows better results with the accuracy of 96.93%, whereas the accuracy of word2vec embedding is 84.53%. The improvement in the performance of the system based on glove embedding with regard to the accuracy may be due to the imperative role of the co-occurrence information of glove embedding in recognising the entities.
Introduction
Social media has emerged as one of the most significant sources of massive data. The content available on social media platforms is considered to be information. The data created through these platforms such as views, likes, shares, follows, tweets, comments also contribute to the information that can produce meaningful units of data. Processing social media data requires additional effort since the data is noisy and unstructured in nature.An example tweet from the dataset is shown in Fig. 1.

Example tweet for Tamil Social Media Text in Twitter.
On analysing such a huge data, we can find trivial information. This paper is based on the primary application in natural language processing, called Entity Extraction. On processing text data, the notable key elements available in the content of information are the entities. These entities are sectioned into pre-defined categories such as Location, Entertainment, Person, Count, Year, Month and a few more. Entity types that appear to be rigid designators are prevailing as the predefined type of entities [10]. A sample of words denoting its respective entity tag is shown in Fig. 2.

Sample words from dataset representing named entities.
It can be seen in a text that the named entity defines itself authoritatively among similar attributes [20]. Dataset used in this work is Tamil social media text extracted from the Twitter platform. The need for pre-processing raw data is high in case of social media text. On considering social media text, users tend to share their content in unorganised text format. Hence, there will be symbols, smileys and spelling variations. By eliminating such pointless information gives through several pre-processing steps, the Twitter data is set for feature extraction method. This paper implements entity extraction using word embedding methodology to uncover the named entities present in the data. Specifically, it uses embedding vectors from word2vec and glove. Implementation and evaluation of the system are using machine learning classifier, Support Vector Machine (SVM) [4].
Contribution to this work includes [i] Extraction of Tamil social media text from the Twitter platform. [ii] Manual tagging of unlabeled corpora. [iii] Experimentation of tri-gram embedding in Word and glove model. [iv] Experimentation of joint embedding concept using word2vec and glove vectors. The effectiveness of word embedding in the field of text processing is discussed in this paper under several sections. Research works related to entity extraction and word embedding methodologies carried on the dataset in different languages are discussed in Section 2. Definition and mathematical description for word2vec and glove are specified in Section 3. Section 4 includes the overview of the methodology used to implement the work mentioned in this paper. Experimental procedures and their corresponding results are included in Section 5. Conclusion and future work regarding this paper is given in Section 6.
Word embedding methodology is emerging as the most promising feature extraction method that outperforms several traditional features. The fundamental idea behind word vector generation is given using the detailed description of the architecture of wang2vec model [22]. An entity extraction system is implemented using the structured skip-gram model for Malayalam social media text [17]. Glove embedding retrieves vector representation of a word by performing dimensionality reduction on co-occurrence counts matrix [15]. Glove embedding is used for building a sentence classification system using Convolutional neural network (CNN) [23]. An approach using the neural network for parts of speech (POS) tagging utilises word and glove embedding [21]. Named entity recognition (NER) and POS tagging is implemented using Bi-directional LSTM (Bi-LSTM) and convolutional neural network (CNN) [9]. A language independent approach is carried out to develop a named entity recognition system for Indian languages [3]. Event extraction was performed based on the extracted named entities from Twitter data [19]. K-Nearest Neighbors (KNN) classifier with Conditional Random Fields (CRF) is utilised for performing entity recognition in Twitter data [8]. Entity extraction task organised by FIRE 2015 was implemented using machine learning classifier with stylometric features [1]. Named entities in the Bengali language is recognised with the help of features such as prefix, suffix, word context, POS information and Gazetter features [2]. A hybrid approach in combination with the statistical method is proposed for named entity recognition [5]. Statistical Hidden Markov Model (HMM) was used for performing Entity extraction of Social Media text in Indian Languages (ESM-IL) task in FIRE2015 [7]. Conditional Random Fields (CRF) based NER model was developed using POS, Chunk, Prefix, Suffix features in English, Hindi and Tamil language [11]. A shared task was organised on sentimental analysis of twitter data in Indian languages such as Bengali, Hindi and Tamil [12]. Extraction of entities in the form of opinion targets was implemented based on bootstrapping [24]. Named entity extraction was implemented using active learning strategy aiming to minimize the annotation cost while using unlabelled data [25]. An experimental study was done over POS, chunking and Named entity recognition for tweets [18]. An overview of the methodologies carried out by the systems implemented in the shared task for entity extraction in Indian languages is analysed [13]. Besides the social media text in Unicode format, the code-mix form of text is prevalent on social media platforms. It includes the roman script with the user’s native script language. With an increase in complexity, it paves way for research scope. Code-Mixed Entity Extraction (CMEE) system is implemented using twitter dataset in Tamil-English and Hindi-English [16]. Extraction of entities was performed in code-mixed tweets in Indian languages for a shared task in CMEE-IL at FIRE2016 [14].
Proposed system
Word2vec and glove embedding
The proposed system is employed using word embedding based models such as word2vec and glove. This paper includes three experimentations. Performing entity extraction with the features extracted from word2vec model. The second system makes use of embedding vectors acquired from glove model. The third system is based on the feature set retrieved from the combination of vectors of word2vec and glove. The basic ideas behind word2vec and glove embedding models in specified in this section. Glove embedding is prevalently known as count based model. It retrieves the vector representation of each word from the frequency of co-occurrences of context words. With the training data as input, the primary task done by glove embedding model is the construction of co-occurrence matrix. This is followed by performing factorization on the matrix that yields a lower dimensional representation for each word in the training data. Therefore, in glove embedding the count co-occurrence is mapped to word context co-occurrence from which the vector representation of words is achieved. The cost function used in this model is given in Equation 2 [15]
Word2vec learns the geometric encoding of words using context information. The main aim is to improve the predictive ability of centre words as the context words are given.
An illustrative diagram is given in Fig. 3, where x0 stands for center word and x-2, x-1, x1, x2 refers to the contextual words of x0. The proposed work stated in this paper uses the structured skip-gram model in word2vec to retrieve the vector representation of words in the training data. The probability function of softmax classifier is shown in the equation below Equation 2 [22]

Architectural illustration of word2vec model.
The proposed system uses word2vec and glove to extract the word embedding features. The resulting features are subjected to N-gram embedding (Here, N = 3). Now, tokenized words are appended to their respective tags and n-gram embedding features. This serves as the data for training the system based on word2vec and glove embedding vectors using machine learning classifier, SVM. System workflow is described through the schematic sketch in Fig. 4.

Schematic diagram of methodology.
The dataset extracted from the Twitter platform is exposed to pre-processing steps such as Tokenization, Hyperlinks removal, and BIO-format conversion. The dataset includes several meaningless symbols and tokens. Without proper pre-processing, it will create ambiguity at the time of feature extraction. BIO format conversion for Named entity recognition (NER) tasks is a traditional method in NLP.
The named entity tags such as PERSON, LOCATION are converted into B-PERSON, I-PERSON and B-LOCATION and I-LOCATION, where B stands for the beginning tag and I stands for inside tag. The tokens other than named entities in the training data are tagged as O, which stands for outside tag. Based on these pre-defined categories of entity tags, manual tagging of part of the unlabeled data set is done in this work. An example tweet with BIO format tagging of named entity is shown in Fig. 5. The whole set of tagged named entities is available in the annotation data.

Example for Named Entity in BIO format.
Social media text from the Twitter platform is utilised in the proposed work. Tweets in Tamil script is extracted from Twitter through Python code. A part of the dataset is from FIRE2015 task and the remaining unlabeled data is extracted from the Twitter resource for the proposed system. Statistics of the dataset used in this paper are tabulated in Table 1. Training data includes 10,000 tweets with 1,28,605 tokens, out of which 53,549 tokens are annotated data retrieved from the FIRE2015 dataset and remaining 75,056 tokens are manually tagged.
Statistics of Dataset used in Proposed system
Statistics of Dataset used in Proposed system
In the training dataset, there are about 23 named entities represented in BIO format. Among 23 entities in the whole corpus, based on the frequency of occurrence 4 major entities are taken and their percentage of occurrence among all the named entity tags is tabulated in Table 2.
Individual count and Percentage of Major Named-entities
The unlabeled Twitter corpus used in the proposed work contains 74,574 tweets, extracted from Twitter. The dataset used in the proposed system is Tamil text data from the Twitter platform. The raw data includes Tweet ID, User ID and Tweet. In any NLP task, the initial step carried out on obtaining a raw dataset is pre-processing. The raw data is in the unstructured format, moreover, it includes redundant tweets, ID and Special symbols which are removed at the time of tokenization. Hyperlinks are eliminated using the regular expression. In spite of that, the presence of hyperlink is denoted using the string HTTP. Annotation data holds the entity tags for the named entities in training data. From the annotation data, entity tag for each word in the training data can be obtained in BIO format.
The input to the word embedding models is in the form of sentences, here in this case tweets. The reason behind this fact is that vectors are generated for each word based on its contextual knowledge. In order to attain context information, a word has to be trained with respect to the context. The word vector and glove models take input as sentences and generate vectors for vocabulary words, based on the min-count set at the time of training the model. Thus, word vectors of size 100, from word2vec and glove embedding model is retrieved. Tri-gram embedding of these vectors is performed to enrich the features of each word with context knowledge. Consider a word X0, its tri-gram embedding vector set will be the combination of the vector of the previous word X-1, current word X0 and next word X1. SVMLight [6] based support vector machine classifier is used to implement the three systems based on word2vec, glove and word-glove vectors respectively. Cross-validation results stating the overall accuracy values, the accuracy of known, unknown and ambiguous tokens of the above-mentioned three systems is tabulated in Table 3.
Cross Validation results for the systems based on Word2vec, Glove and Word-Glove features
Cross Validation results for the systems based on Word2vec, Glove and Word-Glove features
The experiment results from SVM is further refined and analysed for entity wise accuracy. Among 23 entities present in the training data, 4 major entities were selected based on the frequency of occurrence. Entity wise computation is done which is based on the evaluation metrics such as precision, recall and f1-measure for these major entity tags. Precision (P) denotes the number of tweets that are correctly classified under a label (say x), of the tweets that are classified under label x. Recall (R) value denotes the number of tweets that are correctly classified under label x, to the tweets that are actually under the label x. F1-measure (F1) values denote the harmonic mean of precision and recall, in other words, it refers to the accuracy of the system. The computation of Precision (P), Recall (R) and F1-measure (F1) results are tabulated in Tables 4, 5, 6 for the systems based on embedding vectors from word2vec, glove and word-glove system respectively.
Performance Evaluation of system using word2vec embedding features
Performance Evaluation of the system using Glove embedding features
Performance Evaluation of system using word-glove embedding features
From the above mentioned results, it is evident that tri-gram embedding of glove embedding vectors performs better than word2vec based embedding vectors.
The system performance analysed under categories such as overall accuracy, entity wise computation of Precision, Recall and F1-measure. To get a detailed view of the misclassified entities, the confusion matrix is generated with the entity tag obtained as output during the testing phase of the system.
Figure 6 shows the confusion matrix for system using word2vec features (a), glove features (b) and Word-Glove embedding features (c) respectively. It can be inferred from the confusion matrix that some of the entities is misclassified due to the nature of multifaceted nature of the few words. To be specific, confusion matrix shows that some of the words that belong to the entity I-PERSON are faultily classified into B-PERSON. For example, consider the words Narendra modi and Modi. First word has two tokens tagged as B-PERSON (Narendra) and I-PERSON (Modi). The second word tagged as B-PERSON (Modi). Even though both the words refer to the same person, the usage of the same word in different form makes difference in the tagging. Hence, named entities of such types are erroneously classified into another category of entity. An example of such words that are classified as an entirely different named entity is given in Fig. 7.

Performance evaluation of the classification model using (a) word2vec features, (b) Glove features and (c) Word-Glove features.

Example for Named Entity in BIO format.
The first word, as in the dataset with respect to the context refers to a cement company, hence it has been tagged as ORGANIZATION. But it has been tagged as MATERIAL. The second word denotes the entity YEAR but tagged as COUNT. This is due to the reason that the entities YEAR and COUNT involves numerical. The third word refers to the movie name ROJA in Tamil cinema, hence it should fall into the entity type of ENTERTAINMENT. These are the inferences made from errors analysis of predicted entities.
In this paper, the word features are focused on the most trending embedding models such as word2vec and glove. Feature enrichment is achieved by performing N-gram embedding. The results clearly depict the fact that tri-gram features outperform uni-gram features. Better clarity in tagging of entities during training can disambiguate the minor differences. The proposed system has proven its novelty through the joint embedding features and implemented using machine learning classifier. It can be seen that glove embedding vectors capture the semantic meaning of words better than word2vec features. Future scope of this work will be using deep learning methods such as Recurrent Neural Network (RNN).
