Abstract
The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. This paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using Bi-directional Long Short Term Memory model. Social media platforms are now widely used by people to express their opinion and interest. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We recommend a deep learning framework based on cBoW and Skip gram model that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The context capture module of the system gives better accuracy for word embedding model as compared to character embedding.
Keywords
Introduction
In this Hindi words are labeled as H and English word are labeled as E and Named entity as NE. We can observe from the example that the Hindi words, tagged as H, were written in Roman Script instead of Unicode characters. The paper presents a novel architecture, which captures information at both word level and context level to output the final tag for language identification in context to the word belonging to which language. For word level, we have used a multichannel neural network (MNN) inspired by the recent works of computer vision. Such networks have also shown promising results in NLP tasks like sentence classification [2]. For context capture, we used Bi-directional Long Short Term Memory (BLSTM). The context module was tested more rigorously as in quite a few of the previous work, this information has been sidelined or ignored. We have experimented on Hindi-English (H-E) code mixed data. Hindi is the most popular spoken language of India. Here Hindi words are written in Roman transliterated form using the English alphabet.
For processing monolingual text, the primary step would be Part-Of-Speech (POS), tagging of the text. But in the case of social media text, the primary concern is to identify the languages used in the text [3]. The language identification for code-mixed text proposed in this paper is implemented using word embedding models. The term word embedding refers to the vector representation of the given data capturing the semantic relation between the words in the data. The work is a generalized approach because this system can be extended for other NLP applications since only word embedding features are considered. The work involves features obtained from two embedding models, word-based embedding and character-based embedding. A comparison of the performance of the two models with the addition of contextual information is performed in this paper. The machine learning [4] based classification is used for training and testing of the system. Framework for discovering user intend based on Hindi roman transliteration by identifying the word level language identification was addressed here. The remaining section of the paper is organized as follows: An overview of the related works on language identification in the multilingual domain is discussed in Section 2. A discussion on the methodology proposed considering word embedding and character embedding method is discussed in Section 3. The dataset description is stated in Section 4. Section 5 describes the experimental evaluation and results obtained. Section 6, analyses the inferences obtained from the work done and a pointer towards the future work.
Related researches
In this section, some of the recent techniques regarding the language transliteration and identification is listed and reviewed as follows.
Code-switching and mixing is a current research area in the field of language tagging. Language Identification (LID), is a primary task in many text processing applications and hence several researches are going on in this area especially with the code-mixed data. King and Abney [5] used semi-supervised methods for building a world level language identifier. Nguyen and Dogruöz [6] used CRF model limited to bigrams for identifying the language. Logistic regression along with a module which gives code-switching probability was used by Yogarshi et al. [7]. Das and Gamback [8] used various features like a dictionary,
A shared task on Mixed Script Information Retrieval (MSIR) 2015 was conducted in which a subtask includes language identification of 8 code-mixed Indian Languages, Telugu, Tamil, Marathi, Bangla, Gujarati, Hindi, Kannada, and Malayalam, each mixed with English [9]. The MSIR language identification task was implemented by using machine learning based SVM classifier and obtained an accuracy of 76% [16]. Word level language identification was performed for English-Hindi using supervised methods [10]. Naive Bayes classifier was used to identify the language of Hindi-English data and an accuracy of 77% was obtained [11].
Language Identification is also performed as a primary step to several other applications. [12], implemented a sentiment analysis system which utilized MSIR 2015 English-Tamil, English-Telugu, English-Hindi, and English-Bengali code-mixed dataset. Another emotion detection system was developed for Hindi-English data with machine learning based and Teaching Learning Based Optimization (TLBO), techniques [13]. Part-of-Speech tagging was done for English-Bengali-Hindi corpus including the language identification step [14].
Framework for word origin detection.
Since the code-mixed script is the common trend in the social media text today, many kinds of research are going on for the information extraction from such text. An analysis of the behavior of code-mixing [15] in Hindi-English Facebook dataset was done. POS Tagging technique was performed on code-mixed social media text in Indian languages [16]. A shared task was organized for entity extraction on code-mixed Hindi-English and Tamil-English social media text [17]. Entity extraction for code-mixed Hindi-English and Tamil-English dataset was performed with embedding models [18]. Sapkal and Shrawankar [19] have given the approach by the use of SMS which is meant for communicating with others in minimal words. The regional language messages are printed using English alphabets due to the lack of regional keywords. This SMS language may fluctuate, which leads to miscommunication. The focus was on transliterating short form to full form. Zubiaga et al. [20] had mentioned language identification, as the mission of defining the language of a given text. On the other hand, certain issues like quantifying the individuality of similar languages in multilingualism document and analyzing the language of short texts are still unresolved. The below section describes the proposed methodology to overcome the research gap identified in the area of transliterated code mixed data. Alekseev and Nikolenko [29] considered word embedding as an efficient feature and proposed entity extraction for user profiling using word-embedding features.
The proposed work is based on the findings of related research in the field of code mixing. The complexity and need of identifying language in code mixed data is modeled and presented in the work. The code mixed data include the combination of the native script (familiar language) and the non-native script (unfamiliar language). Due to this combination, a massive number of complications arise while dealing with this mixed code. Language Identification is the main and the foremost problem identified in the mixed code data since every user may not be clear about every language recognition in the globe. The problem of language identification arises when the text is written in different languages. This also incorporates problems such as the script specifications leading to the possibility of different scripts between the source and target languages.
Methodology of the MNN for language prediction.
The proposed system is comprising of two modules. The first one is a multichannel neural network trained at the word level, while the second one is a simple bidirectional LSTM trained at the context level. The Fig. 1 describes the first module of the system where code mixed input data is processed and tokenized for embedding. There are many probable spelling variations exists in Hindi roman words. Character embedding is done for Hindi roman words and word embedding is done for English words found in input text. The tokens are matched with the trained bilingual lexicons and are given to MNN where it is used on the basis of words, parts of speech and
The Fig. 2 describes the working of MNN as second module for predicting the words belonging to language labels L
In this proposal, two techniques were considered based on word-based embedding features and character-based context features. This is done to get comparative analysis for the embedding model. The character based approach has the same procedure as that of word-based except that the vectors are character vectors in case of character based context embedding is concerned.
For understanding the embedding models consider this example ‘girl-woman’ vs. ‘girl-apple’. For us, it is quite obvious to understand the associations between words in a language. We know that ‘girl’ and ‘woman’ have more similar meanings than ‘girl’ and ‘apple’ but if we want computers to understand these associations word embeddings come into play. Word embeddings transform human language meaningfully into a numerical form. The main idea here is that every word can be converted to a set of numbers called N-dimensional vector. Every word gets assigned to a unique vector. Similar words end up having values closer to each other while non similar words will have far distances. The vectors for the words ‘woman’ and ‘girl’ would have a higher similarity than the vectors for ‘girl’ and ‘apple’ when represented in vector space, their vectors would be at a shorter distance from each other. The idea behind representing this is, that for any given two words, if these two words have a similar meaning, they are likely to have similar context words. For these numerical representations to be really useful, the goal is to capture meanings, semantic relationships, similarities between words, and the context of different words as they are used naturally by humans. The meaning of a word can be captured, to some extent, by its use with other words. For example, ‘food’ and ‘hungry’ are more likely to be used in the same context than the words ‘hungry’ and ‘software’. And this is used as the basis of the training algorithms for word embeddings.
For the embedding to capture the word representation code-mixed Hindi-English social media data is used. The embedding model generates the vector of each vocabulary (unique), word present in the data. Along with extracting the feature vectors of the train data, its context information is also extracted. The incorporation of the immediate left and right context features with the features of the current word is called 3-gram context appending. Five-gram features were also extracted, which is the extraction of features from two neighboring words before and after the current word. So if the vocabulary size of the training data is
The word-based embedding model is used to find the feature vectors that are useful in predicting the neighboring tokens in a context. Word embedding is a technique where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network. The key to the approach is the idea of using a dense distributed representation for each word where the word in the vocabulary represents the word feature vector. The feature vector represents different aspects of the word and each word is associated with a point in a vector space. The number of feature is much smaller than the size of the vocabulary. Considering the following two documents the vocabulary size is 13 and 15. Consider this example to understand the process of generating feature vectors.
Document 1: “agar aap is page ke follower hain to is page ko like karein”
Document 2: “agar aap is page ke follower nahi hain to is page ko like nahi karein”
The feature size is 11 and 12 respectively for documents 1 and 2. The feature size is calculated by counting the number of times each word occurs in each document, so the feature vector for documents is:
The
Skip-gram model.
Word2vec is a predictive model that is used to produce word embedding’s from raw text [31]. It exists in two forms, the continuous Bag-of-Words model (cBoW) and the Skip-Gram model. Algorithmically, these two are similar, except that cBoW forecasts target words from source context words, whereas the skip-gram forecasts source context words from the target words. This gives the flexibility to use skip- gram when we are having a large dataset and one can use cBoW for the smaller dataset. We focused on the skip-gram model for language identification at word level in the multilingual domain to answer (word belongs to which language) in the rest of this paper. The illustration of Skip-gram model is shown in Fig. 3. Here the input token is T
Where
The procedure for character embedding is the same as that of skip-gram based word embedding. Each token in the trained data gets splitted into characters and then fed to the system. This will generate a vector for each character. The vector size to be generated was fixed as 100. The vectors generated for each character is used to create vectors for each token as per Eq. (3).
In regard to above equation softmax parameters are denoted by
The word PAANI is split into characters and given to the system to produce an embedding feature vector. The vectors are generated for each character in the word. These are then transformed to produce the character-based embedding vector of the word PAANI using Eq. (3). The vectors for each token are then used to extract the context feature vectors. To understand effectively consider an example the word
Embedding model.
Each document must consist of words from two languages. All the documents must be in a single script. The chosen script, in this case, is ROMAN Script. In the Indian scenario, code-mixing is applicable between English and other Indian languages. The language used in the proposal is English and Hindi, where Hindi is represented using Roman, not Devanagari.
If the Hindi words are written in Devanagari script, it is then a simpler task to identify the language. This becomes non-trivial tasks to identify the language as both Hindi and English are written using the same character set.
The algorithm takes the input as code mixed text where each word is represented as a bag of character
Dataset ICON 2016 [25]
Sample data
The dataset used for this work is obtained from POS Tagging task for Hindi-English code-mixed social media text conducted by ICON 2016 [25]. The dataset contains the text of three social media platforms namely Facebook, Twitter and WhatsApp. The train data provided contains the tokens of the dataset with its corresponding language tag and POS tag.
The dataset used here for language identification is Indian language corpora used in the FIRE2014 (Forum for IR Evaluation) shared task on transliterated search. Data used for training the classifier consists of bilingual documents containing English and Hindi words in Romanized script for Bollywood Song Lyrics. Complete database of songs consists of 63,000 documents in form of text file. (Dataset of FIRE MSIR). The below table shows the sample dataset showing various transliterated variations for non-English word and a second sample for mixed script data having words as English and transliterated Hindi words.
Experimental results
The next section discusses the complete experimental part along with results and consequent discussions.
Experimental results
The proposed algorithm for retrieving language of the word in code mixed data is evaluated on the basis of statistical measures and also evaluated using the machine learning approach. The below section provides the complete evaluation based on the statistical model. We performed two separate experiments on the code mixed data to rationalize the performance of the language, we have computed code-mixing patterns in the dataset on two metrics. This is being used to know the mixing patterns in the dataset. The proposed system is analyzed and evaluated on the basis of following code mixing metrics.
where
where
Where
MI and CMI values
To understand the model, consider the following scenario, sentence S1 contains ten words. Five words are from Language L1 and remaining 5 words are from Language L2. Applying Eq. (6) the CMI will be 100
(a): word level similarity; (b): word level similarity.
Secondly, we computed the similarity score based on the proposed algorithm on the dataset using the Eq. (7). It gives significance in labeling the word as either English or Hindi based on the frequency of the word. The proposed algorithm checks the Conf_Score of the classifier for Language Lj on input Wi as 0
Description of the labels for Hindi-English dataset
Sentence level similarity.
Visualization of word level language identification by the statistical model.
The next section describes the experimental evaluation based on applying BLSTM neural model. The dataset used for this work is obtained from POS Tagging task for Hindi-English code-mixed social media text conducted by ICON 2016 [25]. The dataset contains the text of three social media platforms namely Facebook, Twitter and Whatsapp. We use the Hindi-English dataset for the experimental evaluation. The labels used are summarized in Table 4.
The training data contains the tokens of the dataset with its corresponding language tag and POS tag.
Embedding dataset
F measure obtained for Twitter
F measure obtained for Facebook
F measure obtained for WhatsApp
F-score for label E, H, and NE.
All the seven tags are present in the Facebook dataset, where ‘E’, ‘H’, ‘NE’, ‘Other’ are the tags present in Twitter and Whatsapp data. The size of the training and testing data is summarized in Table 4. From the table, it can be observed that the average tokens per comment of Whatsapp training and testing data are very less than Facebook and Twitter data. This may be due to the fact that Facebook and Twitter data mostly contains news articles and comments which make the average tokens per comment count to be more while Whatsapp contains conversational short messages.
For generating the embedding vectors, more dataset has to be provided to efficiently obtain the distributional similarity of the data. The additional dataset collected along with the training data will be given to the embedding model. The Hindi-English additional code-mixed data were collected from Shared task on Mixed Script Information Retrieval (MSIR), conducted in the year 2016 [26] and 2015 [27] and shared task on Code-Mix Entity Extraction task conducted by Forum for Information Retrieval and Evaluation (FIRE), 2016 [28]. Most of the data collected for embedding is Hindi-English code-mixed Twitter data. The size of the dataset used for embedding is given in below table.
Visualization of word representation learned by theBi-LSTM model for Hindi-English.
Visualization of character representation learned by the Bi-LSTM model for Hindi-English.
Context appending was done for each Facebook, Twitter and WhatsApp train as well as test data. These were given to the learning model for training and testing. The cross-validation accuracies obtained for Facebook, Twitter, and WhatsApp with 1-gram, 3-gram and 5-gram features for character-based embedding model and word-based embedding model is presented in below section. When comparing the overall accuracy obtained for Facebook, Twitter, and WhatsApp, we can see that the accuracy obtained is more with the word-based model as compared to character-based embedding model. It can also be observed that in the word-based embedding model, 3-gram-based features give more accuracy than 1-gram and 5-gram context feature model while in character-based model 5-gram gives more accuracy than 1-gram and 3-gram. When observing Tables 6–8 here shows the performance of Facebook, Twitter and WhatsApp Hindi-English code-mixed data, we can see that the F-score for language labels E – English, H – Hindi, NE – Named Entity is better using word embedding.
From the performance of data tabulated in Tables 6–8, it is clearly seen that the word embedding 3-gram based model gives a better score than other models. Table 6, holds label wise accuracy for Twitter data, Table 7 holds label wise accuracy for Facebook data and Table 8 holds label wise accuracy for WhatsApp data. It can be observed from the table that 3-gram word embedding model gives significant accuracy in comparison to 1 gram and 5 gram word embedding and also to character embedding model whereas in case of character gram model accuracy is better in 5 gram model except for WhatsApp accuracy where 5 gram shows better accuracy. This is because the system needs more context information to identify the language. That is why the 5-gram embedding gives a better result in the case of WhatsApp for character embedding techniques. Fig. 8 describes the analysis of F-score obtained for Facebook, Twitter and WhatsApp to represent different labels of text as E – English, H – Hindi and NE for Named entity.
We tend to envision the representations learned by the RNN model by the word embeddings for the selected subset of words from datasets. The above result maps the labels to colors’ indicating the defined seven parameters defined in Table 4. The color encoding is summarized as follows: 1) Red for label E, 2) Blue for Label H, 3) Black for Label NE, 4) Orange for Label Others, 5) Purple for Label Ambiguous and Mixed, and 7) Yellow for Label Unk (Unrecognized word).
The Figs 9 and 10 give a visual representation of the model trained in context to character level embedding and word level embedding in code mixed environment. The results are promising in terms of word level embedding as compared to character level embedding. The proposed neural model gives a clearer separation between the different labeling parameters as defined in Table 4 along with giving a crystal clear separation between the language Hindi and English used in the code mixed dataset. This result shows that this model can be scaled to detect language in code mixed data without any additional feature engineering for detecting other languages present in code mixed and in code switched environment.
The intricacy of language identification in code mixed and code switched data is governed by the following parameters: data source, code switching, code mixing, and the relation between the languages involved. We find that the code mixing is more used in social media context as per the evaluation and experiments. In this work. Code mixing metrics helps in identifying the code-mixing patterns across language pairs. By analyzing the code mixing metrics we conclude that Hindi-English words are often mixed in our dataset. It would be a great idea to investigate the emerging trend of code switching and code mixing to bring conclusion about the behavioral patterns in the data of different sources like lyrics of songs, chat data having different language sets, blog data and scripts of plays or movies. We have implemented two different evaluation models: statistical model and neural based learning model and obtained competitive results for the identification of languages. This is probably due to the amount of training and testing data we have. The results depict that the word embeddings are capable to detect the language separation by identifying the origin of the word and correspondingly mapping to its language label. The BLSTM system performs better for HIN-ENG language pairs. This model captures long-distance dependencies in a sequence and this is in line with the observation made above for identifying word level language identification in code mixed data considering the context of the word belonging to labeled languages. Scaling this system to identify other characteristics in code mixed data considering blend of different languages is a potential future direction to explore.
