Abstract
The task of author profiling aims to distinguish the author’s profile traits from a given content. It has got potential applications in marketing, forensic analysis, fake profile detection, etc. In recent years, the usage of bi-lingual text has raised due to the global reach of social media tools as people prefer to use language that expresses their true feelings during online conversations and assessments. It has likewise impacted the use of bi-lingual (English and Roman-Urdu) text in the sub-continent (Pakistan, India, and Bangladesh) over social media. To develop and evaluate methods for bi-lingual author profiling, benchmark corpora are needed. The majority of previous efforts have focused on developing mono-lingual author profiling corpora for English and other languages. To fulfill this gap, this study aims to explore the problem of author profiling on bi-lingual data and presents a benchmark corpus of bi-lingual (English and Roman-Urdu) tweets. Our proposed corpus contains 339 author profiles and each profile is annotated with six different traits including age, gender, education level, province, language, and political party. As a secondary contribution, a range of deep learning methods, CNN, LSTM, Bi-LSTM, and GRU, are applied and compared on the three different bi-lingual corpora for age and gender identification, including our proposed corpus. Our extensive experimentation showed that the best results for both gender identification task (Accuracy = 0.882, F1-Measure = 0.839) and age identification (Accuracy = 0.735, F1-Measure = 0.739) are obtained using Bi-LSTM deep learning method. Our proposed bi-lingual tweets corpus is free and publicly available for research purposes.
Introduction
The identification of personality and demographic traits from the written text is termed as author profiling [19]. These personality and demographic traits include age, gender, language, region, etc. Author profiling implications in the forensics, marketing, and content recommendations have developed it into a preferred research area in Natural Language Processing. Linguistic forensics can be used to identify the originality of an author’s text and verifying the authorship attributes. Similarly, marketing and content recommendations can help to identify the interests of users, recommend advertisements and suggest materials to the users based on the blogs and reviews on various products [29].
The English language has been the major interest of researchers for author profiling especially gender identification [29] but with ever-increasing usage of social media, the focus has shifted from the formal style of writing to more informal and unstructured writing styles [9]. It has helped in the introduction of local languages being used for communication between the same origin users. Recently, there has been a lot of work in the development of corpora in languages other than English for author profiling. It includes the author profiling of English and Arabic emails [8, 9], creation of Twitter-based corpus for author profiling that includes 6 European languages, i.e. Italian, Spanish French, German, Portuguese and Dutch [5]. PAN’s 1 author profiling corpora are mono-lingual and focus on different languages (English, Italian, Arabic, Spanish, Dutch and Portuguese) and genres (tweets, hotel reviews, blogs, social media) [22–27].
Smartphones and gadgets have played a vital role in bringing individuals from various societies to come closer. It has encouraged people to express their remarks or assessments in one or more languages on similar occasions or items by means of social media tools. Thus, bi-lingual remarks on a similar subject have turned out to be more typical nowadays. Adequate use of the important bi-lingual information contained on social media is a challenging task because of the absence of viable instruments and techniques to investigate the bi-lingual data, dependable and precise. It has opened a new set of opportunities and challenges for researchers. For example, Abbasi et al. [1] used bi-lingual data in English and Arabic for authorship identification. Yan et al. [30] developed a bi-lingual corpus (Chinese and English) for social media sentiment analysis. Al-Rowaily et al. [4] used bi-lingual (Arabic and English) text in the cyber security domain by developing a sentiment analysis lexicon for analyzing dark web forums.
Urdu is the National Language of Pakistan and written in Perso-Arabic script. There is a lack of Urdu script support over various social media websites and on smart devices. Hence, it is often used as Roman-Urdu, which is Urdu words written using English alphabets [14]. Roman-Urdu is picking up consideration in research trends. Researchers have developed Roman-Urdu corpora for different tasks. Fatima et al. [11] developed the Roman-Urdu corpus for bi-lingual author profiling on Facebook data. Fatima et al. [10] also generated an SMS based corpus for gender and age identification. Afzal et al. [3] collected English and Roman-Urdu tweets for Spam detection. Similarly, Javaid et al. [14] generated a bi-lingual (English and Roman-Urdu) corpus for sentiment analysis. There is still a lot to be done regarding author profiling on bi-lingual data as the already developed bi-lingual corpora are either focused on genres other than Twitter or developed for topics other than author profiling.
This research aims at the development of a bi-lingual Tweets-based Author Profiling Corpus (hereafter called BT-AP-19 Corpus 2 ) that contains tweets written in English and Roman-Urdu. The corpus was generated by collecting users’ demographic information (age, gender, education level, province, language, and political party) along with their Twitter ids. These Twitter ids were then used to extract 41,864 tweets collected from 339 different profiles using Twitter API. Moreover, we applied four deep learning based methods to evaluate our proposed corpus for gender and age identification. Deep learning based methods include Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), Bidirectional Long Short Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU). Furthermore, we compared the performance of the deep learning methods on Facebook-based RUEN-AP-17 Corpus [11] and SMS-based SMS-AP-18 Corpus [10] bi-lingual (English and Roman-Urdu) corpora.
We believe that BT-AP-19 Corpus will be useful in: (1) making a direct comparison of existing author profiling techniques on bi-lingual tweets-based author profiles, (2) developing and evaluating new techniques for bi-lingual author profiling, (3) foster and boost research in a low-resourced language i.e. Roman-Urdu, (4) creation of a bi-lingual dictionary for English - Roman-Urdu language pair and (5) developing and evaluating techniques for spelling variations in Roman-Urdu text.
The rest of this paper is organized as follows: Section 2 describes literature review of existing corpora. Section 3 explains the corpus creation process. Section 4 presents models for author profiling. Section 5 describes experimental setup. Section 6 focus on the empirical results and their analysis. Lastly, Section 7 concludes the paper with future research directions.
Literature review
Recently, multiple benchmark corpora are made available for author profiling tasks. One of the most prominent efforts for author profiling has been through PAN Author Profiling Competitions held annually starting from 2013 [22–27]. The primary contribution of these competitions have been benchmark author profiling corpora. PAN Author Profiling Competitions have focused primarily on two traits: age and gender, except for the competitions held in 2015 and 2017, which included the prediction of personality traits and native language identification, respectively. In all the competitions, gender traits had two classes, i.e. male and female but for the age, there have been different distributions. In PAN-2013 competition, there were three age classes: 10s [13–17], 20s [23–27] and 30s [33–47], PAN-2014 and PAN-2016 competitions had five age classes: 18–24, 25–34, 35–49, 50–64 and 65+ whereas PAN-2015 competition had four age classes: 18–24, 25–34, 35–49 and 50-xx.
PAN has explored various genres and languages for corpus generation. These genres include social media, Twitter, blogs and hotel reviews. Among all these genres, the majority of corpora are developed using tweets. Tweets based corpora have been used in PAN-2014, PAN-2015, PAN-2016, PAN-2017, and PAN-2018 Author Profiling Competitions. In the PAN-2014 competition [26], the English corpus consisted of 306 Twitter profiles whereas Spanish corpus consisted on 178 Twitter profiles for training purpose. The PAN-2015 competition [25] corpus was made up of four tweets based sub-corpora in four European languages i.e. English, Spanish, Italian and Dutch. The training data consisted of 152, 110, 38 and 34 profiles for English, Spanish, Italian, and Dutch, respectively. In the PAN-2016 competition [27], only the training corpus was completely twitter based which consisted of 428 English, 250 Spanish and 384 Dutch profiles. In the PAN-2017 competition [24], 500 profiles were collected for each variety of languages, i.e. English, Spanish, Arabic, and Portuguese, for training purpose. In the PAN-2018 competition [22], both text and images were part of the corpus. The text corpus consisted of 1500 Arabic profiles and 3000 profiles for both English and Spanish. As can be noted all the PAN Author Profiling Corpora are mono-lingual and the majority of them are tweets based, highlighting the focus of the research community on this genre. In addition, the size of most of the tweets based corpora is not very large. This shows that it is a non-trivial task to construct benchmark corpora using tweets.
There have been efforts in the literature to develop other mono-lingual author profiling corpora using tweets. Rao et al. [28] classified latent user attributes using tweets. They collected English tweets of 2200 twitter users to predict gender, age, origin, and political interests. They applied the ’focused search’ approach using Twitter API, followed by manual annotations to build the corpus. The corpus contained 1000 profiles for age prediction, 500 each for gender and regional-origin and 200 profiles for political orientation. Similarly, Burger et al. [6] used Twitter API to randomly collect tweets in thirteen different languages for the task of gender identification. Corpus consisted of profiles of 184,000 Twitter profiles which included 45% male profiles and 55% female profiles. Verhoeven et al. [5] developed a manually annotated, tweets based corpus for gender and personality identification. Their corpus consisted of data in Spanish, French, Dutch, Portuguese, German and Italian languages. The corpus consisted of 411 German, 490 Italian, 1000 Dutch, 1405 French, 4090 Portuguese and 10,777 Spanish profiles. The majority of the profiles in each language group consisted of female profiles.
In previous studies, efforts have also been made to develop bi-lingual (English and Roman-Urdu) corpora. Fatima et al. [10] developed multilingual (English and Roman-Urdu) SMS corpus for identification of age and gender. They collected 84,694 SMS messages from 810 profiles. These profiles consisted of 610 male users and 200 female users who were mainly students from ages 15-25 years. In [11] Fatima et al. collected posts from 479 Facebook users for bi-lingual author profiling tasks. Profiles were distributed into gender and three age groups, i.e. users below 19 years, 20-24 years and above 25 years. The primary contributors of the corpus were also young people who were below 25 years of age. Apart from author profiling, Afzal et al. [3] collected a corpus of 2000 English and Roman-Urdu tweets from five major cities of Pakistan using Twitter API for Spam detection. Similarly, Javaid et al. [15] collected 89,000 (English = 82,224 and Roman-Urdu = 6,847) tweets for sentiment analysis task.
To conclude, the majority of existing corpora are mono-lingual (developed for English and other European languages). Some efforts have focused on developing bi-lingual (English and Roman-Urdu) author profiling corpora but they focus on Facebook posts/comments and SMS messages genres. This study contributes a benchmark bi-lingual (English and Roman-Urdu) tweets-based author profiling corpus. As far as we are aware, no such corpus has been developed previously.
Corpus generation process
This section presents the process followed in the creation of our proposed BT-AP-19 Corpus including challenges in corpus generation, user selection and data extraction, and corpus characteristics and standardization.
Corpus generation challenges
Twitter 3 is one of the renowned social networking websites for micro-blogging. Each user is allowed to write a maximum of 140 characters in a single micro-blog called tweet. Twitter allows its users to keep their profiles public or protected (by default all the profiles and tweets are public). The Twitter server allows researchers and developers to extract public tweets using Twitter API 4 via Twitter id or keywords.
Data collection for author profiling using Twitter tweets is a challenging task. Firstly, Twitter records only basic information of users such as user id, language, date of birth and country. All of these, except user id, are totally optional and can be changed at any time. Thus, none of the demographic attributes we might be interested in are available, such as age, gender, qualification, etc. Secondly, we can only view and collect public tweets as per twitter policy using twitter API. The collection of protected tweets requires user’s credentials which calls for a privacy threat.
Due to the above-mentioned challenges, we are left with two possible options: (1) focused search of Twitter profiles followed by manual annotation of data as suggested by Rao et. al [28] and (2) requesting Twitter users to provide their Twitter ids’ along with their true demographic information (Twitter API can be used to fetch tweets using Twitter ids’ collected from participants). Using the former approach, large number of user profiles can be collected but it has some serious limitations: (i) we are likely to collect incomplete/incorrect demographic information, which is unacceptable in constructing a gold standard author profiling corpus and (ii) focused search can also induce a bias, a user selection can be made with focus on people that explicitly state their demographic information [20]. Due to the above-mentioned problems in the focused search approach, we opted for the latter approach to build our proposed corpus. Although this resulted in greater efforts for data collection but the collected data was of high quality, realistic and demographic information was correct (since users themselves provided them).
User selection and data extraction
Our approach for Twitter user profiles collection is divided into two steps: (1) manual collection of Twitter id along with the demographic information and (2) using Twitter API to collect tweets of users who provided their id in the first step. For the first step, a Google Form 5 was created and for the second step a software was developed that used Twitter API for tweets collection.
In data collection, the main goal was to collect data from a variety of users while ensuring the correctness of demographic information and minimizing the user selection bias, which can be induced either by focused search [20] or collecting demographics of users by relying on explicit social media websites [6]. For this purpose, a Google Form was designed to collect the user’s demographic information. The Form required the users to provide their Twitter id along with their demographic traits including (1) age, (2) gender, (3) occupation, (4) education level, (5) native province, (6) native language and (7) preferred political party. The Form was shared among Twitter users by two means: (1) Online, by sharing its link using emails and tweets and (2) Printed Forms, by locally distributing the printed version of the same Google Form distributed online. Responses of both manual and online survey forms were then saved in electronic form using the spreadsheets. Authors traits information would assist us in developing, evaluating, analyzing and comparing methods for author profiling.
Twitter API allows a single user’s tweets to be fetched at a time. Fetching tweets from a large number of profiles was a hectic task. To overcome this issue a software was developed to automatically collect the tweets of all respondents using Twitter API. The survey responses recorded in the spreadsheet file were used as an input to the software. Each user’s tweets were extracted sequentially and stored in a separate text file. The re-tweets were removed from the data as it wouldn’t have reflected the true author’s textual content [17]. All the usernames mention in tweets were replaced by a common placeholder (@username) in the text files for anonymity.
The total number of profiles collected was 550, which included responses from both the manual and online survey forms. The profiles were then scrutinized for validity, duplication and the minimum number of tweets. Each individual Twitter id was verified for correctness and it was found that 27 entries had an incorrect Twitter id. Since the Google form allows users to submit multiple responses, 11 entries of the remaining profiles were duplicate profiles that were removed. Further, each individual text files was evaluated for minimum tweets check. The limit of 140 characters was defined as the minimum threshold because it is possible to identify the authorship attributes in a 140 character’s tweet [18]. It was revealed that there were 173 profiles with fewer than 140 characters in the text files, i.e. a maximum size of one tweet. These profiles and their data was discarded. These profiles mainly belonged to inactive Twitter users and people who never or very rarely tweet. A total of 211 profiles were discarded to give us 339 valid profiles.
Corpus characteristics
Our proposed BT-AP-19 Corpus contains 339 author profiles. The total number of tweets in the corpus is 41,864 and the average number of tweets in an author profile is 124.
Table 1 shows the five most frequent English, Roman-Urdu and common-words extracted from BT-AP-19 Corpus. As can be noted that English words tend to have more instances than Roman-Urdu words. This highlights the fact that people tend to use more English words than Roman-Urdu words while writing tweets but there is considerable usage of Roman-Urdu words in the tweets. One of the reasons for low Roman-Urdu word usage is the fact that there are few Roman-Urdu words with the same spelling as of English words but there pronunciation and meaning is different than English words. These words are referred to as common-words such as ′to′ is sometimes referred by users for an Urdu word Pronounced ’tou’ which means ′so′ in Urdu. Another reason for greater instances of English words compared to Roman-Urdu words is that most people tend to write a sentence which is a combination of Urdu and English while tweeting, like one of the users in BT-AP-19 Corpus, tweeted: “Back to Linux... Ruby Sirf tmharay liay;) ” meaning “Back to Linux... Ruby(Ruby on Rails) only for you;) ”.
Top five word in BT-AP-19 Corpus
Top five word in BT-AP-19 Corpus
The corpus consists of two gender groups, four age classes, two qualification classes, three provincial classes, five native language classes, and four political party classes. Since the data collection is random, the profiles are not balanced to any of the classes. Table 2 provides the details of all traits, names of the classes and number of profile instances in each class. There are 205 female and 134 male profiles in the gender group, which shows that women were more open to sharing profiles than men. The four age classes are classified as the author from the age group of 18-24 years, 25-34 years, 35-49 years and 50-xx years. Since most of the Twitter users are young people, the same is depicted in the age classes where the highest contribution is from the age group of 18-24 (139 profiles) but there are substantial number of contributions from the other age groups also, age groups 25-34 (88 profiles)and 35-49 (80 profiles) whereas the age group of 50 and above had least contribution with 32 profiles. The qualification is also divided into two groups, group one consists of qualification from 12th grade to 16th grade which is also termed as graduation level qualification. Group 2 consists of education level from 16 onwards which is commonly known as a postgraduate level qualification. Group 1 and group 2 has 246 and 93 profiles respectively. The province class reveals that most of the participants are from the two most densely populated provinces of Pakistan named Punjab and Sindh. The users belonging to all other provinces have been grouped as others. The language attribute has five classes, i.e. English, Urdu, Punjabi, Sindhi, and all other languages. The profiles distribution of language shows that the majority of the responses are collected from the users who speak either English, Urdu or both. The contribution from the rest of the languages is very low. The last attribute highlights the political affiliations of the users. Three major political parties that attracted the responses from users are Pakistan Muslim League - Nawaz (PML(N)), Pakistan People’s Party (PPP) and Pakistan Tahrek-a-Insaf (PTI). The majority of the users selected others from the choice which has the highest number of 154 responses. It shows that the majority of the users refrained from declaring their political affiliation.
Distribution of profiles for different traits in BT-AP-19 Corpus
Deep learning borrows the concept from the neural network in which neurons, like the human brain system, work together in several layers to perform a specific task. Deep learning models used in this study include, Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), Bidirectional Long Short Term Memory (Bi-LSTM) and Gated Recurrent Units (GRU). This section describes the details of these models:
Word embeddings
Words and the sequences of words are the crude features that are contained in the corpus. Each word is considered as a solitary component and the framework attempts to learn vector representation as per its unique circumstance. In Word-Embeddings, each comparative word is given a similar representational vector which reduces the corpus vocabulary size to a great deal without the loss of data. Word embedding is used widely to solve the text classification related problems [2].
We performed our experiments using GloVe word embeddings [21]. The GloVe is based on an unsupervised learning model that utilizes similar words collection for vectors representation. It generates statistics of the words co-occurrences in matrix form and searches context words in specified window size for each term. Higher weights are given to close-by words than distant words. This process generates linear substructures of the vector which are called word embeddings.
For experimentation, we used 100-dimensional Twitter-based GloVe pre-trained corpus 6 , containing 6 billion tweets, 27 billion tokens, and 1.2 million vocabularies.
Convolutional neural network
Convolutional Neural Network (CNN) [16] is a very famous model of feed-forward neural networks. In CNN, information travels only in the forward direction without being retained. It comprises of three principal layers, i.e., Input Layer, a Convolutional Layer with pooling and an Output Layer. The information in the form of embedding is given to the Input layer which advances the information to the Convolutional Layer. Convolutional Layer gathers the embeddings and uses a "k" measured sliding window and filters to gather the convolutions of fixed length, which is also termed as a feature set. The same sliding window and filter are applied over every sentence. The convolutions are then followed by activation function RELU, to convert negative values to zero. Every Convolution is then applied with the pooling layer to minimize the computational complexity and dimensionality reduction. The same process is repeated for all the hidden layers. Next, it is forwarded to a fully connected layer that generates a feature map. The softmax layer is then applied over a fully connected layer for the classification purpose.
Long short term memory
Long Short Term Memory (LSTM) [13] is a well know adaption of Recurrent Neural Network (RNN). RNN differs from CNN in the sense that it retains previous information in a memory cell for processing of next information. RNN works well with short terms but starts to lose information with longer sequences, which is called vanishing gradient problem. LSTM is the solution to this problem, which is equipped to learn both short and long sequences information. LSTM has a repeating memory module structure in which four neural network layers collaborate together to process and store interesting information in the memory cell. The architecture is often termed as the gated structure which consists of a cell state and three gates which are termed as input gate, forget gate and output gate. The cell state acts as the memory cell and retains the information. The forget gate works with a cell state to decide about the important info to hold and discards the rest of it. The input gate takes the decision about the required information to update, create the updated information and update the cell state with the new information. The updated cell state information is then passed through the output gate which then filters out the information to decide which specific part of the information to let through as an output.
Bidirectional long short term memory
LSTM has the ability to retain short and long term information but it fails in understanding complete structures of sentences as it only considers the preceding information. It has no structure to take succeeding information into consideration. Bidirectional Long Short Term Memory (Bi-LSTM) [12] works in both left and right directions. It considers the previous, current and next information while making the output decisions. Bi-LSTM has the same working procedure as LSTM but with one minor difference. Bi-LSTM trains using two hidden layers for the input information processing instead of a single hidden layer in LSTM. The first layer works with the previous information whereas second layers work on the input data in reverse direction to the first layer. The output attained from the first layer and the second layer are then concatenated together It helps it to gather increased access to contextual data for training purposes, which results in greater efficiency.
Gated recurrent unit
Gated Recurrent Unit (GRU) [7] is another very famous variant of RNN. It also utilizes the same basic architecture of RNN and attacks the vanishing gradient problem. Its structure differs from LSTM as instead of three gates in LSTM, GRU has two gated structure. These gates are often termed as reset gate and update gate. Initially, the update gate decides which and how much of the previous information to be consolidated with the new information. This consolidated value helps in deciding the amount of preceding information to be passed along for the next phase. The reset gate utilizes the updated information from the update gate to categorize the amount of past information to drop and generate the new value, which is treated as the output value.
Experimental setup
Dataset
The experiments were performed on three different bilingual (English and Roman-Urdu) datasets belonging to three different genres (1) Twitter-based BT-AP-19 Corpus, (2) Facebook-based RUEN-AP-17 Corpus, and (3) SMS-based SMS-AP-18 Corpus. All the experiments were evaluated for age and gender identification tasks.
BT-AP-19 Corpus
BT-AP-19 corpus is the collection of tweets extracted from 339 profiles. In the BT-AP-19 Corpus, there are 134 male profiles and 205 female profiles for gender identification task, whereas, for age identification task, profiles are divided into four age groups: 18–24 (139 profiles), 25–34 (88 profiles), 35–49 (80 profiles) and 50-xx (32 profiles).
RUEN-AP-17 Corpus
RUEN-AP-17 corpus [11] is collection of status and comments posted on Facebook. The users voluntarily provided the personal information and the Facebook status/comments. Each user contributed 500 posts/comments with a minimum of five words in each post. The profiles were collected from 479 Facebook users which consists of 328 male profiles and 151 female profiles. Age-wise the users were grouped into three age groups, (1) xx-19 (170 profiles), (2) 20–24 (218 profiles), and (3) 25–xx (91 profiles).
SMS-AP-18 Corpus
SMS-AP-18 corpus [10] is based entirely on SMS messages. The users manually provided these SMS messages along with their demographic information. It consists of 84,694 SMS collected from the 810 profiles which include 200 female profiles and 610 male profiles. This corpus also consisted of three age classes, (1) 15-19 (292 profiles), (2) 20-24 (424 profiles), and (3) 25-xx (94 profiles).
Techniques
To demonstrate how our proposed BT-AP-19 Corpus and already existing RUEN-AP-17 Corpus and SMS-AP-18 Corpus can be used for the development and evaluation of bi-lingual author profiling systems, we applied four deep learning methods, namely CNN (see Section 4.2), LSTM (see Section 4.3), Bi-LSTM (see Section 4.4) and GRU (see Section 4.5)
Evaluation methodology
The problem of bi-lingual author profiling is casted as a supervised document classification task. Two versions of classification were used: (1) binary classification - which aims to distinguish male profiles from female ones (gender identification task) and (2) multi-classification - which aims to discriminate between different age groups (age identification task). For the BT-AP-19 Corpus, the age identification task aimed at discriminating four different age groups, i.e. 18-24, 25-34, 35-49 and 50-xx. For the RUEN-AP-17 Corpus, three age groups were defined: xx-19, 20–24, and 25–xx. Similarly, for the SMS-AP-18 Corpus also defined three age groups, 15-19, 20-24, and 25-xx, for the age identification task.
For the CNN model, we tried various mixes of the number of layers, number of filters and the window sizes for best appraisal of the CNN model for our experimentation. Three-layered CNN with the Max-pooling structure was observed to be best combination. Each hidden layer consisted of 128 filters, window size of 5 and RELU as an activation function. A single Fully connected Dense layer was applied and it was followed by Softmax layer for classification.
LSTM, Bi-LSTM, and GRU adopts the basic structure of the Recurrent Neural Network. We used 64 units and a recurrent dropout of 0.2 for training a single layered LSTM, Bi-LSTM and GRU models.
Hyper-parameters Settings
Hyper-parameters are all the preparation factors set manually with a pre-decided values, before beginning the training phase of deep learning models. In our experiments, initial efforts were involved in surveying the best parameter combinations. The surveying process was performed manually using random combinations for searching the generalized optimum parameters that produced best accuracy score.
Embedding dimensions selection is a very careful decision because the longer embeddings dimension often don’t include enough information whereas smaller ones fails to capture the semantics. Embedding layer for our experiments was seeded with the GloVe pre-trained word embedding weights. We experimented with 50,100 and 200 dimensions GloVe pre-trained embeddings. 100-dimensional embedding was selected, as it produced better results for most models.
Mini-batch or batch size is directly concerned with the computational power. Since, all the experimentations were performed using a PC consisting of Intel core i7 @ 3.4GHz processor and 16 GB of RAM, the batch size was set to 64, in-order to provide smooth training process.
Regularization dropout was introduced, in order to cater the issue of over-fitting and under-fitting. The value of the dropout was set to 0.5. Furthermore, all of the models produced the best accuracy from 20-50 epochs, so the number of epochs was set to 50.
Adam was used as the optimizer with the default learning rate = 0.001.
Evaluation Measure
The performance of our experiments are evaluated using two measures (1) Accuracy and (2) F1-Measure. Accuracy is the most intuitive performance measure which is defined as the ratio of correctly classified instances to the total classified instances.
Accuracy is generally consider as the standard evaluation measure when we have balanced dataset. For the evaluation of imbalanced datasets, accuracy as the performance measure may be misleading. So in-order to effectively evaluate the deep learning models on our imbalanced datasets F1-Measure was also used as evaluation measure.
F1-Measure primarily looks for harmonic mean between two decision measures, precision and recall, as mentioned in equation 2.
The performance of four models used in the study are compared with the Most Common Category(MCC). We used MCC as our Baseline Accuracy. MCC of BT-AP-19 corpus is 0.601 for gender trait while for age it is 0.410. For RUAP-AP-17 Corpus it is 0.680 for gender and 0.460 for age whereas for the SMS-AP-18 corpus MCC is 0.753 and 0.523 for gender and age, respectively.
Table 3 depicts the results obtained using deep learning models, CNN, LSTM, Bi-LSTM, and GRU, using GloVe pre-trained embeddings. For the gender identification task, Bi-LSTM produced the best Accuracy result of 0.882 and F1-Measure of 0.839. Similarly, for the age identification task again Bi-LSTM produced the highest Accuracy = 0.735 and F1-Measure of 0.739. These results demonstrate that the Bi-LSTM deep learning model is most suitable for gender and age prediction on our proposed corpus. It can also be noted that the results of age identification are much lower than the gender identification task. This highlights the fact that it is easy to discriminate between two classes (gender identification task) than four classes (age identification task). Also, this reflects that the vocabulary used by males, females, and people from different age groups in bi-lingual text is quite different, which enabled the deep learning models to distinguish between different gender and age classes.
Results using deep learning models on BT-AP-19, RUAP-AP-17 and SMS-AP-18 Corpora
Results using deep learning models on BT-AP-19, RUAP-AP-17 and SMS-AP-18 Corpora
The results of the BT-AP-19 Corpus show that for the gender identification task the bidirectional nature of the Bi-LSTM helped in better capturing the gender differentiating features. The results obtained from CNN and GRU (Accuracy = 0.853, F1-Measure = (0.786 and 0.770)) are lower than Bi-LSTM results. LSTM model has the same base structure as Bi-LSTM but its unidirectional nature is least successful in classifying gender traits and obtained the lowest results (Accuracy=0.839, F1-Measure = 0.760), among all models. Age classification results displayed pretty similar trends as gender classification results. Best result (Accuracy = 0.735, F1-Measure = 0.739) was obtained Bi-LSTM model. Further, the result obtained using the CNN model (Accuracy=0.706, F1-Measure = 0.708) is better than the GRU model result (Accuracy = 0.676, F1-Measure = 0.667). Similar to gender traits, the LSTM model produced the lowest result (Accuracy = 0.645, F1-Measure = 0.699). All the age and gender results on BT-AP-19 Corpus are better than Baseline Accuracy. The results obtained using the RUAP-AP-17 Corpus depicts that for the gender identification task Bi-LSTM produced the best results (Accuracy=0.833, F1-Measure = 0.889) whereas for the age identification CNN produced the best results (Accuracy=0.667, F1-Measure = 0.581). Both, gender and age, results obtained on RUAP-AP-17 Corpus are lower than the results on BT-AP-19 corpus. However, all the accuracy results obtained are better than Baseline Accuracy.
The results of SMS-based SMS-AP-18 shows that, for the gender identification task CNN produced the best results (Accuracy=0.815, F1-Measure = 0.873) whereas for the age-group identification task GRU and Bi-LSTM produced the best accuracy results of 0.667. However, the F1 measure of the GRU (F1-Measure = 0.581) was better than than of Bi-LSTM (F1-Measure = 0.500). All the results again are better than the Baseline Accuracy.
This study presents a benchmark bi-lingual tweets (English and Roman-Urdu) corpus with author traits information (age, gender, education level, province, language, and political party) associated with each profile. We applied four deep learning based models: CNN,LSTM, Bi-LSTM, and GRU for gender and age classification task on three different genres (Twitter, Facebook and SMS) bi-lingual (English and Roman-Urdu) corpora. The results showed that Bi-LSTM based approach is most prominent for both gender identification (Accuracy = 0.882, F1-Measure = 0.839) and age identification (Accuracy = 0.735, F1-Measure = 0.739) task on our proposed Twitter-based BT-AP-19 Corpus. Further, all the results obtained using deep learning models on bi-lingual corpora were very encouraging and were able to surpass the Baseline Accuracy.
In the future, we plan to apply other methods for author profiling tasks and predict traits other than age and gender.
Footnotes
BT-AP-19 Corpus is available publicly for the research purposes only. It can be obtained through e-mail. Licensed by: The Natural Language Processing (NLP) Group, COMSATS University Islamabad, Lahore Campus. License details are included along with the corpus.
