An effective cybernated word embedding system for analysis and language identification in code-mixed social media text

Abstract

The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. This paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using Bi-directional Long Short Term Memory model. Social media platforms are now widely used by people to express their opinion and interest. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We recommend a deep learning framework based on cBoW and Skip gram model that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The context capture module of the system gives better accuracy for word embedding model as compared to character embedding.

Keywords

Language identification transliteration character embedding word embedding Natural Language Processing cBoW skip-gram

1. Introduction

Humans use natural language as their medium for communication. Natural Language Processing (NLP), is an area of Artificial Intelligence where one train the machine to understand and process the text to make human-computer interactions more efficient. Applications of NLP lies under several fields like machine translation, text processing, entity extraction and so on [1]. With the emergence of several social media platforms and the availability of a large amount of text data in them, NLP plays a great role in understanding and generating data today. The social media platforms are used widely today by people to discuss the interests, hobbies, reviews on products, movies and so on. In earlier days, the language used in such platforms was purely English. Today mixing multiple languages together is a popular trend. These kinds of languages are called code-mixed language. A large amount of textual data is available on the web in code mixed format. An example of Hindi-English code-mixed text is described in the following two sentences:

Sentence 1:	GLA	University	ka	course
	NE/OOV	E	H	E
	Structure	kaisa	hai
	E	H	H

Sentence 2:	Aray	Friend	ek	super
	H	E	H	E
	idea	hai	mere	paas
	E	H	E	E

In this Hindi words are labeled as H and English word are labeled as E and Named entity as NE. We can observe from the example that the Hindi words, tagged as H, were written in Roman Script instead of Unicode characters. The paper presents a novel architecture, which captures information at both word level and context level to output the final tag for language identification in context to the word belonging to which language. For word level, we have used a multichannel neural network (MNN) inspired by the recent works of computer vision. Such networks have also shown promising results in NLP tasks like sentence classification [2]. For context capture, we used Bi-directional Long Short Term Memory (BLSTM). The context module was tested more rigorously as in quite a few of the previous work, this information has been sidelined or ignored. We have experimented on Hindi-English (H-E) code mixed data. Hindi is the most popular spoken language of India. Here Hindi words are written in Roman transliterated form using the English alphabet.

For processing monolingual text, the primary step would be Part-Of-Speech (POS), tagging of the text. But in the case of social media text, the primary concern is to identify the languages used in the text [3]. The language identification for code-mixed text proposed in this paper is implemented using word embedding models. The term word embedding refers to the vector representation of the given data capturing the semantic relation between the words in the data. The work is a generalized approach because this system can be extended for other NLP applications since only word embedding features are considered. The work involves features obtained from two embedding models, word-based embedding and character-based embedding. A comparison of the performance of the two models with the addition of contextual information is performed in this paper. The machine learning [4] based classification is used for training and testing of the system. Framework for discovering user intend based on Hindi roman transliteration by identifying the word level language identification was addressed here. The remaining section of the paper is organized as follows: An overview of the related works on language identification in the multilingual domain is discussed in Section 2. A discussion on the methodology proposed considering word embedding and character embedding method is discussed in Section 3. The dataset description is stated in Section 4. Section 5 describes the experimental evaluation and results obtained. Section 6, analyses the inferences obtained from the work done and a pointer towards the future work.

2. Related researches

In this section, some of the recent techniques regarding the language transliteration and identification is listed and reviewed as follows.

Code-switching and mixing is a current research area in the field of language tagging. Language Identification (LID), is a primary task in many text processing applications and hence several researches are going on in this area especially with the code-mixed data. King and Abney [5] used semi-supervised methods for building a world level language identifier. Nguyen and Dogruöz [6] used CRF model limited to bigrams for identifying the language. Logistic regression along with a module which gives code-switching probability was used by Yogarshi et al. [7]. Das and Gamback [8] used various features like a dictionary, $n$ -gram, edit distance and word context for identifying the origin of the word.

A shared task on Mixed Script Information Retrieval (MSIR) 2015 was conducted in which a subtask includes language identification of 8 code-mixed Indian Languages, Telugu, Tamil, Marathi, Bangla, Gujarati, Hindi, Kannada, and Malayalam, each mixed with English [9]. The MSIR language identification task was implemented by using machine learning based SVM classifier and obtained an accuracy of 76% [16]. Word level language identification was performed for English-Hindi using supervised methods [10]. Naive Bayes classifier was used to identify the language of Hindi-English data and an accuracy of 77% was obtained [11].

Language Identification is also performed as a primary step to several other applications. [12], implemented a sentiment analysis system which utilized MSIR 2015 English-Tamil, English-Telugu, English-Hindi, and English-Bengali code-mixed dataset. Another emotion detection system was developed for Hindi-English data with machine learning based and Teaching Learning Based Optimization (TLBO), techniques [13]. Part-of-Speech tagging was done for English-Bengali-Hindi corpus including the language identification step [14].

Figure 1.

Framework for word origin detection.

Since the code-mixed script is the common trend in the social media text today, many kinds of research are going on for the information extraction from such text. An analysis of the behavior of code-mixing [15] in Hindi-English Facebook dataset was done. POS Tagging technique was performed on code-mixed social media text in Indian languages [16]. A shared task was organized for entity extraction on code-mixed Hindi-English and Tamil-English social media text [17]. Entity extraction for code-mixed Hindi-English and Tamil-English dataset was performed with embedding models [18]. Sapkal and Shrawankar [19] have given the approach by the use of SMS which is meant for communicating with others in minimal words. The regional language messages are printed using English alphabets due to the lack of regional keywords. This SMS language may fluctuate, which leads to miscommunication. The focus was on transliterating short form to full form. Zubiaga et al. [20] had mentioned language identification, as the mission of defining the language of a given text. On the other hand, certain issues like quantifying the individuality of similar languages in multilingualism document and analyzing the language of short texts are still unresolved. The below section describes the proposed methodology to overcome the research gap identified in the area of transliterated code mixed data. Alekseev and Nikolenko [29] considered word embedding as an efficient feature and proposed entity extraction for user profiling using word-embedding features.

3. Proposed work

The proposed work is based on the findings of related research in the field of code mixing. The complexity and need of identifying language in code mixed data is modeled and presented in the work. The code mixed data include the combination of the native script (familiar language) and the non-native script (unfamiliar language). Due to this combination, a massive number of complications arise while dealing with this mixed code. Language Identification is the main and the foremost problem identified in the mixed code data since every user may not be clear about every language recognition in the globe. The problem of language identification arises when the text is written in different languages. This also incorporates problems such as the script specifications leading to the possibility of different scripts between the source and target languages.

Figure 2.

Methodology of the MNN for language prediction.

The proposed system is comprising of two modules. The first one is a multichannel neural network trained at the word level, while the second one is a simple bidirectional LSTM trained at the context level. The Fig. 1 describes the first module of the system where code mixed input data is processed and tokenized for embedding. There are many probable spelling variations exists in Hindi roman words. Character embedding is done for Hindi roman words and word embedding is done for English words found in input text. The tokens are matched with the trained bilingual lexicons and are given to MNN where it is used on the basis of words, parts of speech and $n$ -gram available in input text for predicting the language of the words with bidirectional LSTM model. The second module takes the input from the first module applying the context appending features to produce the language tag for the given word as E – English, H – Hindi and O – Others. The neural learning model for predicting the language is illustrated in Fig. 1.

The Fig. 2 describes the working of MNN as second module for predicting the words belonging to language labels L ${}_{\text{E}}$ , L ${}_{\text{H}}$ and L ${}_{\text{O}}$ , where L ${}_{\text{E}}$ is English, L ${}_{\text{H}}$ is Hindi and L ${}_{\text{O}}$ is others. This module is inspired by [30] the recent deep neural architectures developed for image classification tasks [21]. The proposed model uses a very similar concept for learning the language at the word level. This is because the architecture allows the network to capture representations of different types, which can be really helpful for NLP tasks for identifying the origin of a word in context to the language used in code mixed data. The network we developed has 4 channels, the first three enters into a Convolution 1D (Conv1D) network [22], while the fourth one enters into a Long Short Term Memory (LSTM) network [23]. The softmax predicting is used as logistic activation function for multiclass classification. This classifies the words into three classes as English, Hindi and Others.

In this proposal, two techniques were considered based on word-based embedding features and character-based context features. This is done to get comparative analysis for the embedding model. The character based approach has the same procedure as that of word-based except that the vectors are character vectors in case of character based context embedding is concerned.

For understanding the embedding models consider this example ‘girl-woman’ vs. ‘girl-apple’. For us, it is quite obvious to understand the associations between words in a language. We know that ‘girl’ and ‘woman’ have more similar meanings than ‘girl’ and ‘apple’ but if we want computers to understand these associations word embeddings come into play. Word embeddings transform human language meaningfully into a numerical form. The main idea here is that every word can be converted to a set of numbers called N-dimensional vector. Every word gets assigned to a unique vector. Similar words end up having values closer to each other while non similar words will have far distances. The vectors for the words ‘woman’ and ‘girl’ would have a higher similarity than the vectors for ‘girl’ and ‘apple’ when represented in vector space, their vectors would be at a shorter distance from each other. The idea behind representing this is, that for any given two words, if these two words have a similar meaning, they are likely to have similar context words. For these numerical representations to be really useful, the goal is to capture meanings, semantic relationships, similarities between words, and the context of different words as they are used naturally by humans. The meaning of a word can be captured, to some extent, by its use with other words. For example, ‘food’ and ‘hungry’ are more likely to be used in the same context than the words ‘hungry’ and ‘software’. And this is used as the basis of the training algorithms for word embeddings.

For the embedding to capture the word representation code-mixed Hindi-English social media data is used. The embedding model generates the vector of each vocabulary (unique), word present in the data. Along with extracting the feature vectors of the train data, its context information is also extracted. The incorporation of the immediate left and right context features with the features of the current word is called 3-gram context appending. Five-gram features were also extracted, which is the extraction of features from two neighboring words before and after the current word. So if the vocabulary size of the training data is $|\text{V}|$ , and the embedding feature size generated is 100 for each word, then feature vector context appending with 3-gram features, a matrix of size $|\text{V}|$ $\times$ 300 is obtained. Five-gram appending will result in a matrix of size $|\text{V}|$ $\times$ 500. The test data was also given to the embedding models. The data were then appended with the 3-gram and 5-gram context information. These were then fed to a machine learning based classifier, to train and test the system. The below section describes the word and character embedding models.

3.1 Word-based embedding model

The word-based embedding model is used to find the feature vectors that are useful in predicting the neighboring tokens in a context. Word embedding is a technique where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network. The key to the approach is the idea of using a dense distributed representation for each word where the word in the vocabulary represents the word feature vector. The feature vector represents different aspects of the word and each word is associated with a point in a vector space. The number of feature is much smaller than the size of the vocabulary. Considering the following two documents the vocabulary size is 13 and 15. Consider this example to understand the process of generating feature vectors.

Document 1: “agar aap is page ke follower hain to is page ko like karein”

Document 2: “agar aap is page ke follower nahi hain to is page ko like nahi karein”

The feature size is 11 and 12 respectively for documents 1 and 2. The feature size is calculated by counting the number of times each word occurs in each document, so the feature vector for documents is:

{agar, aap, is, page, ke, follower, hain, to, ko, like, karein, nahi}

The {1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1} is the feature vector matrix for D1 and {1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1} is for D2 is. The word vector for the word follower is [1, 1] and for the word nahi is [0, 1]. The feature vector for this model is generated using Skip-gram architecture of popular Word2Vec package proposed by Mikolov et al. [24]. Apart from the skip-gram model, another architecture continuous Bag of Words (cBoW), is also present [24].

Figure 3.

Skip-gram model.

Word2vec is a predictive model that is used to produce word embedding’s from raw text [31]. It exists in two forms, the continuous Bag-of-Words model (cBoW) and the Skip-Gram model. Algorithmically, these two are similar, except that cBoW forecasts target words from source context words, whereas the skip-gram forecasts source context words from the target words. This gives the flexibility to use skip- gram when we are having a large dataset and one can use cBoW for the smaller dataset. We focused on the skip-gram model for language identification at word level in the multilingual domain to answer (word belongs to which language) in the rest of this paper. The illustration of Skip-gram model is shown in Fig. 3. Here the input token is T ${}_{0}$ which is fed to a log-linear classifier to predict the neighboring words. T ${}_{-2}$ , T ${}_{-1}$ , T ${}_{1}$ and T ${}_{2}$ are the words that are before and after the current word. When the data is given to the Skip-gram model, it maximizes the average log probability, given by $L$ , which is formulated in Eq. (1). In the equation, $N$ is the total number of words in the train data and $x$ is the context size. $P$ is the softmax probability which is given using Eq. (2). The goal behind maximizing the likelihood function or log probability is that it maximizes the likelihood probability of a token (e.g. a sentence or a sequence of word) in regard to identify that token in terms of language labels. In NLP the log probability is widely used in implementations of computations with probability, and log probability is simple logarithm of a probability. Here Skip-gram defines the training objective for a single example as the log-probability. By maximizing this objective, skip-gram estimates input and output vectors that reflect semantic relations between words that occur in similar contexts. The input vectors are then used as word embeddings. The main idea behind this maximization is to jointly learn to identify probability of occurrence of next word and predict their contexts.

$\displaystyle L=\frac{1}{N}\sum_{n=1}^{N}\sum_{-x\leqslant i\leqslant x}\log P% (T_{n}+i|T_{n})$ (1) $\displaystyle P(T_{j}|T_{k})=\frac{\exp(V^{\prime}T_{j}(VT_{k}))}{\sum_{w=1}^{% w}\exp(V^{\prime}T_{j}(VT_{k}))}$ (2)

Where $w$ is the vocabulary size, $P(T_{j}|T_{k})$ , is the probability of occurrence of the next word. $V^{\prime}$ is the output vector representation. The dataset along with the additional dataset collected was given to the skip-gram model. The vector sizes to be generated were fixed as 100. The skip-gram model generates a vector of size 1 $\times$ 100 for each vocabulary word available in the dataset. From this, the vectors for the training data were extracted. The context appending features were then extracted from this file. The final training file for the classifier will consist of the tokens in the train data, their language tag and the 3-gram and 5-gram context feature vectors extracted. Thus three training files are generated with $|\text{V}|$ $\times$ 101, $|\text{V}|$ $\times$ 301 and $|\text{V}|$ $\times$ 501 dimension. The test data with its corresponding context appended vectors are fed to the classifier for testing the system.

3.2 Character-based embedding model

The procedure for character embedding is the same as that of skip-gram based word embedding. Each token in the trained data gets splitted into characters and then fed to the system. This will generate a vector for each character. The vector size to be generated was fixed as 100. The vectors generated for each character is used to create vectors for each token as per Eq. (3).

$\displaystyle Y=x+S_{h}(W,C_{t-k},\ldots C_{t+k},C)$ (3)

In regard to above equation softmax parameters are denoted by $x$ and $S$ where $h$ is the embedding features of character and word. $C$ is the character vector and $W$ is the word vector. $c_{t−k}\ldots c_{t+k}$ , are the characters in the trained data. Softmax function calculates the probabilities distribution of the event over ‘ $n$ ’ different events. This function will calculate the probabilities of each target class over all possible target classes. These probabilities will be helpful for determining the target class for the given inputs. In our case Softmax output probabilities range from 0 to 1, and the sum of all the probabilities will be equal to one. The proposed language identification task is a multi classification task based on language identification of the word belonging to Class L ${}_{\text{E}}$ , L ${}_{\text{H}}$ and L ${}_{\text{O}}$ . The softmax function here returns the probabilities of each class and the target class will have the high probability. Thus softmax can determine the likelihood of a word being the part of language English, Hindi or Others.

The word PAANI is split into characters and given to the system to produce an embedding feature vector. The vectors are generated for each character in the word. These are then transformed to produce the character-based embedding vector of the word PAANI using Eq. (3). The vectors for each token are then used to extract the context feature vectors. To understand effectively consider an example the word PAANI can be represented in many forms as “PAANII”, “PAANEE”, “PAANIE”, “PAANEI” and so on. This is possible as this Hindi roman word can be written with many possible transliteration variations. The vector for all the above variations of writing a word PAANI will be different and it will be difficult for the system to manage the context of this word in regard to identify the language. Therefore, the feature vector with context features is appended along with the language tag and is fed to the classifier for training the system. The similar procedure is done for the test file. The vectors generated from character embedding model is then transformed as a context matrix for the test data. This context matrix with the test words is fed to the classifier for testing the system.

Figure 4.

Embedding model.

3.3 Design consideration and proposed algorithm

•
Each document must consist of words from two languages.
•
All the documents must be in a single script. The chosen script, in this case, is ROMAN Script.
•
In the Indian scenario, code-mixing is applicable between English and other Indian languages.
•
The language used in the proposal is English and Hindi, where Hindi is represented using Roman, not Devanagari.

If the Hindi words are written in Devanagari script, it is then a simpler task to identify the language. This becomes non-trivial tasks to identify the language as both Hindi and English are written using the same character set.

Proposed algorithm for language identification

Input: Code Mixed Data

Output: Language of the input word

Algorithm steps

1. Input term from Test Document

Let $D=W_{1},W_{2}\ldots W_{n}$ be a document

Where $W_{i}$ ’s are the words

2. Letter of the words {a-z or A-Z}

3. // word2vec level tagging

3.1 $L_{b}(W_{i})$ chosen from Language L

Where L $=$ {L ${}_{\text{E}}$ , L ${}_{\text{H}}$ , L ${}_{\text{O}}$ }

// L ${}_{\text{E}}$ – English

// L ${}_{\text{H}}$ – Hindi Roman

// L ${}_{\text{O}}$ – Other

// Check the frequencies of character in $W_{i}$

3.2 Generate Vectors for characters for $W_{i}$

3.3 Apply Similarity metrics

$\text{Sim}(X,Y)=\frac{\sum_{i=1}^{n}X_{i}Y_{i}}{\sqrt{\sum_{i=1}^{n}X_{i}^{2}}% \sqrt{\sum_{i=1}^{n}Y_{i}^{2}}}$

4. Label the word E-English or H-Hindi.

5. Check the Conf_Score of the classifier for Language L ${}_{j}$ on input $W_{i}$ as $0\leqslant\textit{Conf\_Score}\leqslant$ 1

Where Conf_Score is similarity metrics

Sim( $W_{x},W_{y}$ ) $x$ and $y$ can be word in string

Sim( $x, y$ ) $\in$ [0, 1] for normalization

Sim( $x, y$ ) $=$ 1: exact match

Sim( $x, y$ ) $=$ 0: completely different $x$ and $y$ .

0 $<\text{Sim}(x,y)<$ 1: approximate similarity

Threshold value $=$ 1 for exact match

1 matches L ${}_{\text{E}}$ $<$ 1 matches L ${}_{\text{H}}$ OR L ${}_{o}$ Based on List condition for L ${}_{o}$ W

If L ${}_{\text{E}}$ matches L ${}_{o}$ W

L $=$ L ${}_{o}$

6. Classify the Word as E, H or O

The algorithm takes the input as code mixed text where each word is represented as a bag of character $n$ -grams, so for example, for the word matter, with $n=$ 3, the fast Text representations for the character $n$ -grams is _ma, mat, att, tte, ter, er_. _ and _ are added as boundary symbols to distinguish the $n$ -gram of a word from a word itself, so for example, if the word mat is part of the vocabulary, it is represented as _mat_. This helps preserve the meaning of shorter words that may show up as $n$ -grams of other words. Word representations are trained to predict words that appear in its context. More formally, given a large training corpus represented as a sequence of words $W_{1}\ldots W_{T}$ . The algorithm takes code mixed input and tokenizes it. Word2vec is used for vector generation. These vector values are used for calculating similarity score of the input words against similarity score of pre labeled words in training set. Language labeling is done on the basis of confidence score The confidence score value lies between 0 to 1. The value 1 is used for exact match and value less than 1 is used for approximate match in case of roman Hindi words and other words. If the confidence score is 1 , the word is labeled as English else it will be classified as Hindi or other. For words belonging to other it is checked against the handcrafted list condition of the trained set for ambiguity as many English words are used in ambiguous manner in roman context also. The list condition uses the words like (main, to, is, us, or, log, array, hat, mat etc.). The objective of the algorithm is to identify the word belongs to which language.

Table 1
Dataset ICON 2016 [25]

Data Average tokens

No. of sentences No. of tokens per sentence

Training Testing Training Testing Training Testing

data data data data data data

Facebbok 772 111 20615 2167 26.7

Twitter 1096 110 17311 2163 15

WhatsApp 763 219 3218 802

Table 2
Sample data

Hindi Roman Data sample:

amir se hoti hai, garib se hotii hai

door se hotee hai, qarib se hoti hai

magar jahaan bhi hoti hai, ai mere dost

shaadiyaan to naseeb se hoti hai

Mixed Script Data sample:

Party abhi baaki hai ………

Party abhee baaki hai ………

Party abhie baakee hae ………

4. Dataset descriptions

Proposed algorithm for language identification
1.	Input term from Test Document
	Let $D=W_{1},W_{2}\ldots W_{n}$ be a document
	Where $W_{i}$ ’s are the words
2.	Letter of the words {a-z or A-Z}
3.	// word2vec level tagging
	3.1 $L_{b}(W_{i})$ chosen from Language L
	Where L $=$ {L ${}_{\text{E}}$ , L ${}_{\text{H}}$ , L ${}_{\text{O}}$ }
	// L ${}_{\text{E}}$ – English
	// L ${}_{\text{H}}$ – Hindi Roman
	// L ${}_{\text{O}}$ – Other
	// Check the frequencies of character in $W_{i}$
	3.2 Generate Vectors for characters for $W_{i}$
	3.3 Apply Similarity metrics
	$\text{Sim}(X,Y)=\frac{\sum_{i=1}^{n}X_{i}Y_{i}}{\sqrt{\sum_{i=1}^{n}X_{i}^{2}}% \sqrt{\sum_{i=1}^{n}Y_{i}^{2}}}$
4.	Label the word E-English or H-Hindi.
5.	Check the Conf_Score of the classifier for Language L ${}_{j}$ on input $W_{i}$ as $0\leqslant\textit{Conf\_Score}\leqslant$ 1
	Where Conf_Score is similarity metrics
	Sim( $W_{x},W_{y}$ ) $x$ and $y$ can be word in string
	Sim( $x, y$ ) $\in$ [0, 1] for normalization
	Sim( $x, y$ ) $=$ 1: exact match
	Sim( $x, y$ ) $=$ 0: completely different $x$ and $y$ .
	0 $<\text{Sim}(x,y)<$ 1: approximate similarity
	Threshold value $=$ 1 for exact match
	1 matches L ${}_{\text{E}}$ $<$ 1 matches L ${}_{\text{H}}$ OR L ${}_{o}$ Based on List condition for L ${}_{o}$ W
	If L ${}_{\text{E}}$ matches L ${}_{o}$ W
	L $=$ L ${}_{o}$
6.	Classify the Word as E, H or O

Data					Average tokens
	No. of sentences	No. of tokens	per sentence
	Training	Testing	Training	Testing	Training	Testing
	data	data	data	data	data	data
Facebbok	772	111	20615	2167	26.7
Twitter	1096	110	17311	2163	15
WhatsApp	763	219	3218	802

The dataset used for this work is obtained from POS Tagging task for Hindi-English code-mixed social media text conducted by ICON 2016 [25]. The dataset contains the text of three social media platforms namely Facebook, Twitter and WhatsApp. The train data provided contains the tokens of the dataset with its corresponding language tag and POS tag.

The dataset used here for language identification is Indian language corpora used in the FIRE2014 (Forum for IR Evaluation) shared task on transliterated search. Data used for training the classifier consists of bilingual documents containing English and Hindi words in Romanized script for Bollywood Song Lyrics. Complete database of songs consists of 63,000 documents in form of text file. (Dataset of FIRE MSIR). The below table shows the sample dataset showing various transliterated variations for non-English word and a second sample for mixed script data having words as English and transliterated Hindi words.

5. Experimental results

The next section discusses the complete experimental part along with results and consequent discussions.

5.1 Experimental results

The proposed algorithm for retrieving language of the word in code mixed data is evaluated on the basis of statistical measures and also evaluated using the machine learning approach. The below section provides the complete evaluation based on the statistical model. We performed two separate experiments on the code mixed data to rationalize the performance of the language, we have computed code-mixing patterns in the dataset on two metrics. This is being used to know the mixing patterns in the dataset. The proposed system is analyzed and evaluated on the basis of following code mixing metrics.

MI: Multilingual index is a measure for word count that quantifies the distribution variations of the language tags in a corpus of languages. Equation (4) defines the MI (Multilingual Index) as:

$\displaystyle\textit{MI}=x=\frac{1-\sum P^{2}J}{(k-1)\sum P^{2}J}$ (4)

where $k$ denotes the number of languages, $P_{j}$ denotes the number of words in the language $j$ over the number of words in the corpus. The value of MI resides between 0 and 1. Value of 0 relates monolingual corpus and 1 relates to the equal number of tokens from each language in a corpus.

CMI: Code-Mixing Index: At the phonetic level, this is calculated by discovering the most frequent language in the utterance and then counting the frequency of the words belonging to all other languages present. It is calculated using Eq. (5).

$\displaystyle\textit{CMI}=\frac{\sum_{i=1}^{n}(w_{i})-\max(w_{i})}{n-u}$ (5)

where $\sum_{i=1}^{n}w_{i}$ is the sum of all languages present in the utterance, $\max\{w_{i}\}$ is the maximum number of words existing from any language (considering the case more than one language can have same maximum word count), $n$ denotes total number of tokens, and $u$ denotes the number of tokens for other language independent tags. If an utterance only contains $u$ (i.e. $N=u$ ) language independent tokens. Its index is considered to be zero. For other utterances, we use the normalization (multiply the value by 100) to acquire the digits in the range of 0 to 100. The next $w_{i}$ are the tagged language words and $\max(w_{i})$ is the most prominent language words. Applying this equation, we will get CMI $=$ 0 for monolingual utterances because $\max(w_{i}=n−u)$ . Equation (5) is normalized as below in Eq. (6).

$\displaystyle\textit{CMI}=\left\{\begin{array}[]{lll}100\times\left[1-\frac{% \max\{w_{i}\}}{n-u}\right]&:&n>u\\ 0&:&n=u\end{array}\right.$ (6)

Where $w_{i}$ are the words labeled with each language tag, $\max\{w_{i}\}$ are the most prominent language words. By applying the above equation we will get a value of CMI as 0 for monolingual and a higher value of CMI designates high mixing of languages.

Table 3

MI and CMI values

Language set	MI	CMI
Hindi-English	0.582	22.229

To understand the model, consider the following scenario, sentence S1 contains ten words. Five words are from Language L1 and remaining 5 words are from Language L2. Applying Eq. (6) the CMI will be 100 $\times(1-5/10)=$ 50. However, another sentence S2 contains 10 words and each word is from a different language. The CMI $=$ 100 $\times(1-1/10)=$ 90. It rightly reflects that S2 is highly mixed as every word belongs to a different language. This CMI value helps us to understand the level of code mixing available in the dataset. The Table 3 describes the values obtained for MI and CMI for the corpus. The significance of computing this helps us to understand the level of code mixing available in the dataset for calculating the word level and sentence level similarity. The MI value in the table shows the word count measure denoting the number of words in different languages where as CMI value provides the most frequent language used in the dataset.

Figure 5.

(a): word level similarity; (b): word level similarity.

Secondly, we computed the similarity score based on the proposed algorithm on the dataset using the Eq. (7). It gives significance in labeling the word as either English or Hindi based on the frequency of the word. The proposed algorithm checks the Conf_Score of the classifier for Language Lj on input Wi as 0 $\leqslant$ Conf_Score $\leqslant$ 1, where Conf_Score is similarity metrics, $\text{sim}(W_{x},W_{y})$ $x$ and $y$ can be the word in a string. The threshold value 1 indicates exact match with the training set. The value 0 signifies the mismatch between word $x$ and word $y$ where as the value between the range 0 to 0.99 (less than 1) works for approximate match in case of Hindi roman words as Hindi roman words can have many possible spelling variations. This is being checked with list condition for being classified as Hindi words. The Figs 5a and 6 provide the value range 0.89 to 1.0 and 0.903 to 1.0 respectively for different words for getting word level similarity and sentence level similarity. This shows a variation possibility of writing one Hindi word in more than one form. The value 1.0 signifies the standard form of representing that word in the dataset. The below section describes the different results obtained on the code mixed dataset for calculating the similarity score at word level and sentence level. Figure 5a and b describes the result obtained at word level for Hindi roman transliterated words in the corpus. Figure 6 plots the similarity at the sentence level. Figure 7 describes the sentence level language identification based on the proposed design and algorithm discussed in Section 3.3.

$\displaystyle\text{Sim}(X,Y)=\frac{\sum_{i=1}^{n}X_{i}Y_{i}}{\sqrt{\sum_{i=1}^% {n}X_{i}^{2}}\sqrt{\sum_{i=1}^{n}Y_{i}^{2}}}$ (7)

Table 4

Description of the labels for Hindi-English dataset

Label	Description	Hindi-English %
E	English words only	57.76
H	Hindi words only	20.41
NE	Named Entity	6.59
Other	Symbols, Emoticons	14.8
Ambiguous	Can’t determine whether Hindi or English	0.27
Mixed	Word of Hindi English in combination	0.08
Unk	Unrecognized word	0.09

Figure 6.

Sentence level similarity.

Figure 7.

Visualization of word level language identification by the statistical model.

The next section describes the experimental evaluation based on applying BLSTM neural model. The dataset used for this work is obtained from POS Tagging task for Hindi-English code-mixed social media text conducted by ICON 2016 [25]. The dataset contains the text of three social media platforms namely Facebook, Twitter and Whatsapp. We use the Hindi-English dataset for the experimental evaluation. The labels used are summarized in Table 4.

The training data contains the tokens of the dataset with its corresponding language tag and POS tag.

E – indicating English words – example: This, and, there

H – indicating Hindi words – example: aisa, mera, tera

NE – indicating named entities like Person, Location and Organization – example: Narendra Modi, India, Facebook

Other – indicating tokens containing special characters and numbers – example: @, #, 0–9

Ambiguous – indicating words used ambiguously in Hindi and English – example: is, to, us

Mixed – indicating words of Hindi-English and number combination – example: MadamJi, Sirji

Unk – indicating unrecognized words – example: t.M, @.s, Ss

Table 5

Embedding dataset

Number of sentences in the dataset used for embedding
(Facebook, Twitter and WhatsApp)
ICON 2016	2631
MSIR 2015	2700
MSIR 2016	6139

Table 6

F measure obtained for Twitter

Embedding type		E	H	NE
Character	1 gram	84.95	93.31	78.3
	3 gram	85.34	93.44	77.1
	5 gram	85.38	93.49	80.2
Word	1 gram	65.86	82.96	62.2
	3 gram	85.71	93.97	83.9
	5 gram	85.42	93.16	78.1

Table 7

F measure obtained for Facebook

Embedding type		E	H	NE
Character	1 gram	85.65	92.92	64.95
	3 gram	86.45	93.36	65.02
	5 gram	85.47	92.55	65.05
Word	1 gram	85.02	92.03	62.80
	3 gram	86.99	93.51	67.21
	5 gram	85.15	92.47	61.03

Table 8

F measure obtained for WhatsApp

Embedding type		E	H	NE
Character	1 gram	52.4	80.1	28.5
	3 gram	54.9	80.2	37.7
	5 gram	54.3	80.9	31.5
Word	1 gram	50.4	79.6	40.0
	3 gram	60.8	81.9	40.2
	5 gram	53.7	80.1	40.1

Figure 8.

F-score for label E, H, and NE.

All the seven tags are present in the Facebook dataset, where ‘E’, ‘H’, ‘NE’, ‘Other’ are the tags present in Twitter and Whatsapp data. The size of the training and testing data is summarized in Table 4. From the table, it can be observed that the average tokens per comment of Whatsapp training and testing data are very less than Facebook and Twitter data. This may be due to the fact that Facebook and Twitter data mostly contains news articles and comments which make the average tokens per comment count to be more while Whatsapp contains conversational short messages.

For generating the embedding vectors, more dataset has to be provided to efficiently obtain the distributional similarity of the data. The additional dataset collected along with the training data will be given to the embedding model. The Hindi-English additional code-mixed data were collected from Shared task on Mixed Script Information Retrieval (MSIR), conducted in the year 2016 [26] and 2015 [27] and shared task on Code-Mix Entity Extraction task conducted by Forum for Information Retrieval and Evaluation (FIRE), 2016 [28]. Most of the data collected for embedding is Hindi-English code-mixed Twitter data. The size of the dataset used for embedding is given in below table.

Figure 9.

Visualization of word representation learned by theBi-LSTM model for Hindi-English.

Figure 10.

Visualization of character representation learned by the Bi-LSTM model for Hindi-English.

Context appending was done for each Facebook, Twitter and WhatsApp train as well as test data. These were given to the learning model for training and testing. The cross-validation accuracies obtained for Facebook, Twitter, and WhatsApp with 1-gram, 3-gram and 5-gram features for character-based embedding model and word-based embedding model is presented in below section. When comparing the overall accuracy obtained for Facebook, Twitter, and WhatsApp, we can see that the accuracy obtained is more with the word-based model as compared to character-based embedding model. It can also be observed that in the word-based embedding model, 3-gram-based features give more accuracy than 1-gram and 5-gram context feature model while in character-based model 5-gram gives more accuracy than 1-gram and 3-gram. When observing Tables 6–8 here shows the performance of Facebook, Twitter and WhatsApp Hindi-English code-mixed data, we can see that the F-score for language labels E – English, H – Hindi, NE – Named Entity is better using word embedding.

From the performance of data tabulated in Tables 6–8, it is clearly seen that the word embedding 3-gram based model gives a better score than other models. Table 6, holds label wise accuracy for Twitter data, Table 7 holds label wise accuracy for Facebook data and Table 8 holds label wise accuracy for WhatsApp data. It can be observed from the table that 3-gram word embedding model gives significant accuracy in comparison to 1 gram and 5 gram word embedding and also to character embedding model whereas in case of character gram model accuracy is better in 5 gram model except for WhatsApp accuracy where 5 gram shows better accuracy. This is because the system needs more context information to identify the language. That is why the 5-gram embedding gives a better result in the case of WhatsApp for character embedding techniques. Fig. 8 describes the analysis of F-score obtained for Facebook, Twitter and WhatsApp to represent different labels of text as E – English, H – Hindi and NE for Named entity.

We tend to envision the representations learned by the RNN model by the word embeddings for the selected subset of words from datasets. The above result maps the labels to colors’ indicating the defined seven parameters defined in Table 4. The color encoding is summarized as follows: 1) Red for label E, 2) Blue for Label H, 3) Black for Label NE, 4) Orange for Label Others, 5) Purple for Label Ambiguous and Mixed, and 7) Yellow for Label Unk (Unrecognized word).

The Figs 9 and 10 give a visual representation of the model trained in context to character level embedding and word level embedding in code mixed environment. The results are promising in terms of word level embedding as compared to character level embedding. The proposed neural model gives a clearer separation between the different labeling parameters as defined in Table 4 along with giving a crystal clear separation between the language Hindi and English used in the code mixed dataset. This result shows that this model can be scaled to detect language in code mixed data without any additional feature engineering for detecting other languages present in code mixed and in code switched environment.

6. Conclusions

The intricacy of language identification in code mixed and code switched data is governed by the following parameters: data source, code switching, code mixing, and the relation between the languages involved. We find that the code mixing is more used in social media context as per the evaluation and experiments. In this work. Code mixing metrics helps in identifying the code-mixing patterns across language pairs. By analyzing the code mixing metrics we conclude that Hindi-English words are often mixed in our dataset. It would be a great idea to investigate the emerging trend of code switching and code mixing to bring conclusion about the behavioral patterns in the data of different sources like lyrics of songs, chat data having different language sets, blog data and scripts of plays or movies. We have implemented two different evaluation models: statistical model and neural based learning model and obtained competitive results for the identification of languages. This is probably due to the amount of training and testing data we have. The results depict that the word embeddings are capable to detect the language separation by identifying the origin of the word and correspondingly mapping to its language label. The BLSTM system performs better for HIN-ENG language pairs. This model captures long-distance dependencies in a sequence and this is in line with the observation made above for identifying word level language identification in code mixed data considering the context of the word belonging to labeled languages. Scaling this system to identify other characteristics in code mixed data considering blend of different languages is a potential future direction to explore.

References

Weischedel

Carbonell

Grosz

et al., White paper on natural language processing, in: Proceedings of the Workshop on Speech and Natural Language, Association for Computational Linguistics, 1989, pp. 481–493.

Kim

, Convolutional neural networks for sentence classification, arXiv preprint arXiv (2014).

Barman

Das

et al., Code mixing: A challenge for language identification in the language of social media, in: Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014, pp. 13–23.

King

Baucom

et al., The IUCL+ system: Word-level language identification via extended Markov models, EMNLP (2014), 102–106.

King

and Abney

, Labeling the languages of words in mixed-language documents using weakly supervised methods, in: Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 1110–1119.

Nguyen

and Dugruoz

A.S.

, Word level language identification in online multilingual communication, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013, pp. 857–862.

Yogarshi

Gella

Sharma

Bali

and Choudhury

, Pos tagging of English-Hindi code-mixed social media content, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 974–979.

Das

and Gamback

, Identifying languages at the word level in code-mixed indian social media text, in:Proceedings of the 11th International Conference on Natural Language Processing , 2014, pp. 378–387.

Sequiera

Choudhury

Gupta

et al., Overview of FIRE-2015 shared task on mixed script information retrieval, in: FIRE Workshops, 2015, pp. 19–25.

10.

Jhamtani

Bhogi

S.K.

et al., Word-level language identification in bi-lingual code-switched texts, in: Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing, 2014, pp. 348–357.

11.

Ethiraj

Shanmugam

Srinivasa

Sinha

, NELIS – Named Entity and Language Identification System: Shared task system description, in: FIRE Workshops, 2015, pp. 43–46.

12.

Bhargava

Sharma

and Sharma

, Sentiment analysis for mixed script indic sentences, in: International Conference on Advances in Computing, Communications and Informatics, ICACCI, 2016, pp. 524–529.

13.

Castilho

Eckart

et al., Cross-platform text mining and natural language processing interoperability, in: Proceedings of the LREC, 2016.

14.

Barman

Wagner

and Foster

, Part-of-speech tagging of code-mixed social media content: Pipeline, stacking and joint modelling, in: Proceedings of the Second Workshop on Computational Approaches to Code Switching EMNLP, 2016, pp. 30–39.

15.

Bali

Jatin

and Choudhury

, “i am borrowing ya mixing?” An analysis of English-Hindi code mixing in Facebook, in: Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014, pp. 116–126.

16.

Vyas

Gella

Sharma

Bali

and Choudhury

, POS tagging of English-Hindi code-mixed social media content, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 974–979.

17.

Rao

P.R.

and Devi

S.L.

, Code mix entity extraction in Indian languages from social media text@FIRE, in: FIRE (Working Notes), 2016, pp. 289–295.

18.

Devi

Veena

Anand Kumar

P.V.

et al., AMRITA-CEN@FIRE 2016: Code-mix entity extraction for Hindi-English and Tamil-English tweets, in: CEUR Workshop Proceedings, 2016, pp. 304–308.

19.

Sapkal

and Shrawankar

, Transliteration of secured SMS to Indian regional language, Procedia Computer Science (2016), 748–755.

20.

Zubiaga

Vicente

I.S.

Gamallo

and Pichel

J.R.

, TweetLID: A benchmark for tweet language identification, Language Resources and Evaluation (2015), 729–766.

21.

Szegedy

Liu

Jia

et al., Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.

22.

Yann

Haffner

Bottou

et al., Object recognition with gradient-based learning, in: Shape, Contour and Grouping in Computer Vision, 1999, pp. 319–345.

23.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation (1997), 1735–1780.

24.

Mikolov

Sutskever

Chen

Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.

25.

Jamatia

and Das

, Task report: Tool contest on POS tagging for code-mixed Indian social media (Facebook, Twitter, and Whatsapp) text, in: Proceeding of ICON, 2016.

26.

Banerjee

Chakma

Naskar

et al., Overview of the Mixed Script Information Retrieval (MSIR), in: CEUR Workshop Proceedings, 2016, pp. 94–99.

27.

Sequiera

Choudhury

Gupta

et al., Overview of FIRE-2015 shared task on mixed script information retrieval, in: FIRE Workshops, 2015, pp. 19–25.

28.

Srinidhi

Singh

Devi

et al., Context based character embeddings for entity extraction in code-mixed text, in: CEUR Workshop Proceedings, 2016, pp. 321–324.

29.

Alekseev

and Nikolenko

, Word embeddings for user profiling in online social networks, Computación y Sistemas (2017), 203–226.

30.

Shekhar

Sharma

D.K.

and Beg

M.S.

, Hindi roman linguistic framework for retrieving transliteration variants using bootstrapping, Procedia Computer Science (2018), 59–67.

31.

Veena

P.V.

Kumar

and Soman

K.P.

, Character embedding for language identification in Hindi-English code-mixed social media text, Computación y Sistemas (2018), 65–74.

Proposed algorithm for language identification
Input: Code Mixed Data
Output: Language of the input word
Algorithm steps
1.	Input term from Test Document
	Let $D=W_{1},W_{2}\ldots W_{n}$ be a document
	Where $W_{i}$ ’s are the words
2.	Letter of the words {a-z or A-Z}
3.	// word2vec level tagging
	3.1 $L_{b}(W_{i})$ chosen from Language L
	Where L $=$ {L ${}_{\text{E}}$ , L ${}_{\text{H}}$ , L ${}_{\text{O}}$ }
	// L ${}_{\text{E}}$ – English
	// L ${}_{\text{H}}$ – Hindi Roman
	// L ${}_{\text{O}}$ – Other
	// Check the frequencies of character in $W_{i}$
	3.2 Generate Vectors for characters for $W_{i}$
	3.3 Apply Similarity metrics
	$\text{Sim}(X,Y)=\frac{\sum_{i=1}^{n}X_{i}Y_{i}}{\sqrt{\sum_{i=1}^{n}X_{i}^{2}}% \sqrt{\sum_{i=1}^{n}Y_{i}^{2}}}$
4.	Label the word E-English or H-Hindi.
5.	Check the Conf_Score of the classifier for Language L ${}_{j}$ on input $W_{i}$ as $0\leqslant\textit{Conf\_Score}\leqslant$ 1
	Where Conf_Score is similarity metrics
	Sim( $W_{x},W_{y}$ ) $x$ and $y$ can be word in string
	Sim( $x, y$ ) $\in$ [0, 1] for normalization
	Sim( $x, y$ ) $=$ 1: exact match
	Sim( $x, y$ ) $=$ 0: completely different $x$ and $y$ .
	0 $<\text{Sim}(x,y)<$ 1: approximate similarity
	Threshold value $=$ 1 for exact match
	1 matches L ${}_{\text{E}}$ $<$ 1 matches L ${}_{\text{H}}$ OR L ${}_{o}$ Based on List condition for L ${}_{o}$ W
	If L ${}_{\text{E}}$ matches L ${}_{o}$ W
	L $=$ L ${}_{o}$
6.	Classify the Word as E, H or O