DPre: Effective preprocessing techniques for social media depressive text

Abstract

Depression has become one of the most common public health issues. Several people with depression rely on social media to express their grief. The text data generated by these users can be exploited to promote study in this area in order to detect early-stage depression and provide support. However, to develop a reliable automatic depression detection system, the social media text cannot be used directly as there is a lot of irrelevant, inaccurate, and noisy information available. Moreover, the basic preprocessing steps which are used with most of the machine learning models have limited functionality and thus lead to lots of information loss. This loss of information is not affordable especially in the domain of affective computing (mental health) for text. In this paper, we present various preprocessing techniques for depressive text, DPre, to obtain readable text from raw and noisy tweets. This method can help in minimizing the loss of information and expressions hidden in the raw tweet. Moreover, the processed and clean text will be ready to input into any machine learning algorithm. The readability of the processed text is evaluated and compared with raw tweets using four readability scores: Flesch Reading Score, Flesch_kincaid Score, the Coleman-Liau Index, and Dale_Chall Score. Compared to basic state-of-art preprocessing methods, the proposed method significantly improved the readability score.

Keywords

Depression social media text pre-processing readability score tweets

1. Introduction

The problem of depression is fast becoming a major public health issue in the 21 ${}^{\text{st}}$ century. There may be many reasons behind this major mental health issue viz. lifestyle, stress, past history, etc. Even though many of us are suffering from this disease, it still has a social stigma attached to it. Early detection of depressive moods can increase the chances of affected individuals obtaining psychological help and overcoming the condition. However, many people with depression are unaware of their condition due to a lack of mental health awareness, a lack of frequent treatment and the fact that mental health disorders are unlike physical ailments, in that they do cause discomfort or pain, resulting in patients never recognize this illness. According to Loughlin et al. [1], despite the fact that some people are aware of depression, they are typically hesitant to seek professional treatment due to a sense of stigma or embarrassment. The authors [2] noticed that the standard treatment diagnosis of depression is based mostly on standardized test scores, which are highly reliable but have drawbacks in sensing and recognizing logicality and proficiency. Clinician not only uses standard assessment scale such as PHQ-9 or self-rating depression scale etc., but they also conduct one-to-one interview sessions. The patient has to go through all these phases for a long time for monitoring purposes, which incurs high costs [3], and following up with the patients becomes tedious. In [35], the authors have discussed various approaches and challenges of depression detection.

Textual data analysis may provide the foundation for research in a wide range of domains, from science to engineering, health domain, decision-making process, and management to process control. Mood disorder, particularly depression has become one of the leading causes of suicidal attempts. With the rise of social media, things are moving fast. Many approaches integrating machine learning algorithms and text mining techniques have been developed in recent years to detect early signs of depression. Therefore, it is essential that information rendered to learning algorithms is prepared at a level that a large amount of correct information can be manipulated by machines. Higher readability has been linked to improved knowledge retention and understanding [13].

Over the last 10 years, there have been several significant advancements in the underlying algorithms and methodologies that have followed the exponential expansion in practical applications for machine learning. Also, there has been a significant increase in social media platforms such as Facebook, Twitter, etc. in recent years. These platforms provide enormous amounts of big data [4, 5]. These data may then be collected in massive quantities and used to train machine learning and deep learning algorithms that will help in decision-making for various purposes. The clean text data has a significant impact on the algorithm’s performance. The algorithm may behave unpredictably due to inconsistent or noisy data. Thus, if data is not preprocessed carefully, the performance may suffer and output will not be as expected. Improving the prediction of computational models’ readability and understandability in the data is the highest priority. The vocabulary used by the people on social media is very casual and friendly. Identifying the real meaning is difficult due to sporadic, short text type (eg. w8, gud, F9, etc.), a short length, and slang content of social media text. Social media textual data give rise to a variety of applications and these all require a large amount of processing [6].

Moreover, the basic preprocessing steps which are used with most machine learning model are not sufficient as it leads to lots of information loss. This loss of information is not affordable, especially in the domain of emotion and mental state recognition. Hence, in this paper, we propose various depressive text preprocessing techniques, DPre, (such as hypertext removal, special characters removal, slang words handling, contraction handling, segmentation of complex words, replacement of elongated words, spelling check, language translation, and negation handling) to make the raw and noisy tweets clean and ready to be fed to any machine learning algorithm. It improves the text’s readability and minimizes the loss of information and expressions hidden in the raw tweet. The readability of the processed text is compared with raw tweets using four readability scores: Flesch Reading Score, Flesch_kincaid Score, the Coleman-Liau Index, and Dale_Chall Score.

The rest of the paper is organized as follows. Section 2, presents the related work and Section 3 describes the standardization techniques used in preprocessing the text. Section 4 discusses the various performance measures and presents the comparative results. Finally, Section 5 brings the conclusions and scope for future work.

2. Related work

User-generated text dominates the social media platforms, which are both loud and sparse. As a result, data gathered from social media must be preprocessed to remove noise, otherwise it may result in information loss [7]. Several health domain kinds of research [8, 9] have been emerging with creative and distinct possibilities as well as with various concerns for researchers. In particular, user-generated content on social media and its network structures are creating a huge impact on the regulation of health services. Therefore, it is important to understand social media content in an accurate manner.

In recent years, several researchers proposed various approaches to integrate machine learning algorithms, NLP techniques as well as text mining techniques to detect the emotional state of the user at an early stage and prevent suicidal attempts using social media [10, 11, 12]. These approaches have proven to be successful and affordable when compared to traditional ways, as well as to decrease restrictions and help in clinical assessment in a more dynamic manner. Simultaneously, individuals are used to sharing their inner feelings or thoughts on social media. The vast corpus has a variety of information expressing emotions such as grief, frustration, and breakdowns, all of which might indicate depression.

Singh et al. [13] experimented with various preprocessing approaches on twitter data that showed improvement in model accuracy. They handle slang words by using an n-gram language model and substituting them with the correct word. They also performed the common preprocessing steps such as removal of special characters, punctuation, URLs, word segmentation, and spell checks. Often users express their opinions or emotions about an entity/product/topic by using emojis or emoticons. It has become increasingly popular on social media, eCommerce sites, and blogs. Fernández-Gavilanes M et al. [14] formed a newer emoji sentiment lexicon based on the definitions provided by emoji founders Emojipedia, as well as lexicon different versions. Ghag et al. [15] experimented with a movie dataset and evaluated the effects of removing commonly used words known as stopwords (such as “the”, “an”, “a”, and “of”) on various sentiment identification models. The authors found that deleting stopwords improves classification accuracy for the classic sentiment classifier. Schofield et al. [16] explored the impacts of preprocessing in sentiment classifiers. The author experimented with common preprocessing techniques such as stemming and stop word removal. The results are ineffective or have very little effect. The authors suggested that only standardized processing techniques would not help in understanding the text, rather preprocessing techniques must be defined as per the domain requirements.

Emojis are frequently treated as noise in semantic classification tasks, and they are often removed from the dataset during the pre-processing step [17]. Emojis, on the other hand, include semantic information due to their widespread use and variation. The research that has been done on using emojis for emotion analysis and sarcasm detection shows that using the semantic information they provide is advantageous [18, 19].

In another study, Phan et al. [20] investigated the use of tweets to detect real-time drug misuse. The authors used a dataset of registered and unregistered medications, as well as original content from 31,478 tweets. For this study, the authors used various machine learning techniques such as Naïve Bayes, Random Forest, and Support Vector Machine (SVM) classifiers for training. In this process, the authors do not use any preprocessing. The final model produced 74% of precision on unseen data. The constructed classifier has been put to the test. Later on, the proposed text preprocessing study comprised using TF-IDF (Term Frequency-Inverse Document Frequency) to represent the significance of a word in a specific post and enhance accuracy. In this study, they used Mechanical Turk to collect large volumes of data. Several studies [21, 22, 23] have been conducted on emotion categories (such as happy, sad, anger, joy, etc.) to identify emotions expressed in text with basic preprocessing steps and hence shown lower accuracy and precision. In [24], the VAD (valence, arousal, and dominance) values for each sentence are learned by training the EMO Bank dataset. This dataset is already annotated with VAD values. The authors applied the basic preprocessing task to the Twitter dataset and generated VAD scores using the trained model against each sentence. The results (R2 score) are not good for almost each regression model apart from the Random Forest. The findings interpret that due to the basic preprocessing step; the data was not properly cleaned, thus not learning the correct word embeddings and resulting in poor performance on the regression models.

In all the approaches discussed above, it is observed that before feeding the textual data to any machine learning model or word embedding techniques, the text should be as clean and understandable as possible, especially when working in the domain of emotion recognition or mental health. For instance, the word embedding techniques treat the word “please”, “plaese”, “pls”, “pleeeeeaase” and “plssssss” in different manners. Moreover, either they ignore the unmatched words or they generate different vectors for each of them. However, the tweeter wants to convey the same meaning while using any of these words. Also, when someone elongates a word or uses some native language word, that means he is emphasizing and hence, more intense sentiment or emotion is hidden. The following sections discussed various techniques to overcome the limitations and handle the observations.

3. Proposed framework: DPre

Language plays a key role in understanding the motive behind it. Therefore, it is important to understand the linguistic signals when constructing language technologies in various fields. Handling text is an important task, if not done correctly, it may lose the information. In the domain of affective computing for text, this loss of information is not affordable. Hence, to preserve all expressions hidden in the text, a number of preprocessing steps are proposed in the paper to clean the raw text decently. Subsequently, it makes the raw tweet more readable and ready for any machine learning algorithm. Moreover, it preserves most of the information and expression which are important in the domain of emotion recognition and mental health.

As illustrated in Fig. 1, DPre processes the tweets through a sequence of steps to remove all kinds of noise and make them clean. In this procedure, it uses a number of corpora/dictionaries/datasets such as NLTK word corpus [25], WordNet corpus [26], Slang word dictionary, Contraction word dictionary, Emojis/Emoticons dictionary, and Sentiment140 dataset [27]. Slang words, Contraction words, and Emojis/Emoticons dictionaries are created by collecting data from different online resources. The purpose of the research is to clean the tweets. Hence, we selected a dataset of 1.6 million tweets (Sentiment140 dataset) available on Kaggle. These tweets are annotated with 0 and 1 as negative and positive sentiment. We considered only tweets and ignored classes. However, the information about classes can be used to identify mood disorder patterns using psychological theories and machine learning techniques.

Figure 1.

DPre: Pre-processing techniques for depressive text.

Using these corpora and dictionaries, DPre performs a number of steps such as hypertext removal, special characters removal, slang words handling, contraction handling, segmentation of complex words, replacement of elongated word, spelling check, language translation, and negation handling (as shown in Algorithm 1).

Algorithm 1: DPre(text): {t1=remove_hyperlinks(text) t2=remove_sp_char(t1) t3=change_slang(t2) t4=change_contractions(t3) t5=emojis(t4) t6=word_segment(t5) t7=change_elong_word(t6) t8=check_spell(t7) t9=lang_trans(t8) t10=handle_negation(t9) return return t10 }

Removing hyperlinks: Hyperlinks do not contain any affect (emotion)-related information. Hence, it is safe to remove them from the text so as to avoid any misleading information. For this, the text is checked against three regular expressions to detect any kind of URL. We identified all hyperlinks, which either start with http, www or ends with .com. Moreover, it detects the URL even if it is spanned in multiple lines (shown in Algorithm 2).

Algorithm 2: remove_hyperlinks(text): {Check(text, ’http[s]?://\\S+’, \"\", Multiline = True) Check(text, ’www.\\S+’, \"\", Multiline = True) Check(text,’\\S+.com[/\\S+]*’, \"\", flags Multiline = True) }

Removing special characters: The characters other than alphabets, digits, full-stop (.), and comma (,) do not contribute much in text analysis. Full-stop and commas are required for sentence-level tokenization. Hence, as illustrated in Algorithm 3, all special characters except full-stop and comma are removed. All the unwanted characters in the list are stored as bad_chars and removed subsequently.

Algorithm 3: remove_sp_char(text): {Create list bad_chars = [’;’, ’:’, ’!’, "*", "#", "@", "$", "%", " $\wedge$ ", "&", "<", ">" , "?", "\", "|", “/”, "+", "=", "-", "_", "{", "}", "[", "]" ,’"’]Loop i in bad_chars : Replace_text(i, ")}

Replacement of slang words, contracted words, and emojis/emoticons: Slang words such as thx for thank you are replaced with the help of a dictionary compiled from different online resources and own observations. The challenge is to identify whether the given word is Slang or a correct word. For instance, the word “am” is normally used with “ $I$ ”. However, it is also used for “morning” by tweeters. These kinds of ambiguities are handled after careful scrutiny of all the tweets of the Sentiment140 dataset and several tweets on Twitter. Subsequently, the Contracted words (eg.I’m $\rightarrow$ I am) and Emojis/Emoticons (eg. :) $\rightarrow$ happy) are replaced. Emojis/Emoticons carry significant affect-related information, hence, they are important to handle.

Word segmentation: Another challenge in preprocessing is multiword expressions. The authors in [28] explored various analytical strategies for integrating multiword phrases used in a real electorate context. The process of adding the word boundary characters in a word containing multiple words (eg. wordsegmentation $\rightarrow$ word segmentation) is called word segmentation.

Though in English, space or punctuation marks are treated as word boundary characters. Tweeters do not follow the formal English guidelines and sometimes multiple words are used together for hash-tag. For instance, the word “thedecisionwasfair" can be processed word-by-word as “the decision was fair”. Generally, # and @ associated words are multiword tokens. Nevertheless, these words cannot be ignored as they may have some affect (emotion) related information. Therefore, word segmentation is an important preprocessing step in high-level NLP tasks. The same has been implemented using dynamic programming based python API wordsegment. To reduce the time complexity, this function segment() is applied only if the word is not a valid word according to WordNet corpus and it consists of alphabets only.

For instance, let the input word is “@comeagainjen”.

After applying word segmentation, it becomes “come again jen”.

Handling elongated words: Elongated word means some characters of a word are repeated several times to emphasize the word. Elongation is considered only if a character is repeated more than 2 times. These words are critical as it displays more affect (emotion). Some examples of elongated words are pleeeeeeease, gooooood, etc. Algorithm 4 describes the proposed change_elong_word() function. This function is applied only if the word is not a valid word according to WordNet corpus and it consists of alphabets only. In this algorithm, firstly the characters which are repeated more than 2 times are identified. Then all possible combinations of these characters are obtained (as shown in example below) by repeating the identified characters 0,1 or 2 times to generate different possible words. Out of all generated possible words, only the valid word which is present in WordNet is considered for further processing.

Algorithm 4: change_elong_word(text): { For each word in text: If word not in WordNet: If word contains only alphabets: Count consecutive occurrence frequency of each character & append the character in a list S and frequency in list S_count $k=$ no of characters having frequency higher than 2 if k == 0 or len(list S) $<$ 2: continue else: m=pow(2,k) loop while m>0: nw = ‘’ b = binary(m-1) following the binary sequence b, generate m words out of m words, keep only WordNet valid words }

For instance:

text = [“good mornnnninggggg everyone"]

S = [’m’, ’o’, ’r’, ’n’, ’i’, ’n’, ’g’]

S_count =[1, 1, 1, 4, 1, 1, 5]

k = 2 #as n and g are repeated more than 2 times

m = 2k = 4 #possible combinations

b=binary(m-1) =binary (3) = 11 # binary equivalent

So, 4 words are generated as

[1,1] ->[‘m’, ‘o’, ‘r’, ‘n’,‘n’, ‘i’, ‘n’, ‘g’,‘g’]

[1,0] ->[‘m’, ‘o’, ‘r’, ‘n’,‘n’, ‘i’, ‘n’, ‘g’]

[0,1] ->[‘m’, ‘o’, ‘r’, ‘n’,‘i’, ‘n’, ‘g’,‘g’]

[0,0] ->[‘m’, ‘o’, ‘r’, ‘n’, ‘i’, ‘n’, ‘g’]

Final combinations = [“mornningg”, “mornning”, “morningg”, “morning”]

After checking in WordNet, output obtained is “good morning everyone”

Checking spelling: A lot of misspelled words can be found in written communication (i.e., darlin $\rightarrow$ “darling”, satify $\rightarrow$ “satisfy”). Bertoldi et al. [29] investigated the impact of misspelled words on the performance of machine learning models. The author found that the performance is connected to the noise level present in the text and the noisy data affects the performance of any learning algorithm. The spell checker is used to remove this noise from the data and again to keep the time for execution to be less, we applied the spell checker only to those words which cannot be matched exactly in WordNet and are made up of only alphabets, and subsequently the misspell word is replaced with the correct word. This is implemented using Python API pyspellchecker illustrated in Algorithm 5.

Algorithm 5: check_spell(text) {For each word in text: If word not in WordNet: If word contains only alphabets: newword = Spell_checker(word) text = replace(word, newword) }

Language translation: On social media, users prefer to use their native language to express their intense feelings or sentiments. Therefore, native language text is more intense, expressive, and informative. However, to process the text automatically and effectively, the entire dataset should be written in the same language. Hence, the language translator is used to take care of intense expressions hidden in the native language and convert them into valid English words. Moreover, the words which are not Noun are processed for translation as Noun generally does not hold any affect (emotion). The same is implemented using Python API of Google translator. Algorithm 6 demonstrates the language translation module, where the language_translate() function returns an object containing translation information.

Algorithm 6: lang_trans(text): {For each word in text: If word not in WordNet: If word contains only alphabets: If POS_TAG(word) is not NOUN: newword = language_translate(word)text = replace(word, newword) }

Handling negation: In general, in a language, there are some complicated constructs, such as negative phrases and semantically unclear terms, which can only be understood correctly if we take into account their meaning or their nearby words. For this reason, negation handling techniques are proposed using a parts-of-speech (POS) tagger to associate POS tags with each word. Here, the main focus is on the word “not” occurrence. Based on the usual observations, it is found that in most of the cases, the word “not" is associated with either adjectives or adverbs on either side. However, in some sentences, “not” is placed either immediately before adjectives (JJ) or adverbs (RB) or sometimes 1–2 tokens before or after the associated word. Hence, it is important to check all the words following ‘not’ till the end of a sentence. If any subsequent word is ‘Adjective (JJ)’, ‘adverbs (RB)’ or ‘verb (VB)’, its antonym is obtained from WordNet. If a word is not found in the WordNet, its synonym is identified in the dictionary and then the antonym is obtained. Once an antonym word is replaced, the word ‘not’ is removed from the sentence. In case, it fails to identify the word to which “not” is associated, the same steps are followed for the words that occur before the word ‘not’ till the beginning of the sentence. After following these steps, if still any ‘not’ containing sentences left, we have removed those sentences from our dataset to have a standardized form i.e. in our corpus, we have kept the sentences that are either positive or negative. The discussed steps are followed for all the sentences which contain the word “not”. Algorithm 7 demonstrates the proposed algorithm for handling negation.

Algorithm 7: handle_negation(text) {If a sentence contains the word ‘not’: Flag = 0 For each word from occurrence of “not” till end of sentence: If subsequent word = Adjective (JJ) or Adverbs (RB) or Verb (VB): If antonym (subsequent word) exists: text = replace([“not”, subsequent word], antonym (subsequent word)) Flag = 1 else: n_w = obtain antonym(synonym( subsequent word)) text = replace([“not”, subsequent word], n_w) Flag = 1 If Flag==0: For each word from occurrence of “not” till start of sentence: If word = Adjective (JJ) or Adverbs (RB) or Verb (VB): If antonym (word) exists: text = replace([word, “not”], antonym (word)) Flag = 1 else: n_w = obtain antonym (synonym(word)) text = replace([word, “not"], n_w) Flag = 1 If Flag==0: Remove the sentence }

For instance, consider the following sentences:

Example 1:

Sentence: “Today my mood is not good”,

Tokenization: [‘Today’, ‘my’, ‘mood’, ‘is’, ‘not’, ‘good’]

POS Tagging: [(Today, NN), (my, PRP), (mood, NN), (is, VB), (not, RB), (good, ‘JJ’)]

Word after ‘not’: [good, ‘JJ’]

Antonym using WordNet: [‘evil’]

New Sentence: [Today my mood is evil]

Example 2:

Sentence “John likes the blue house not the at the end of the street”

Tokenization [‘John’, ‘likes’, ‘the’, ‘blue’, ‘house’, ‘not’, ‘the’,’ at’, ‘the’,’ end’, ‘of’, ‘the’,’ street’]

POS Tagging [(John, NNP), (Likes, VBZ), (the, DT), (blue, JJ), (House, NN), (not, RB),

(at, IN), (the, DT), ( end, NN), (of, IN), (the, DT), (street, NN)]

Word before ‘not’ is: [ Likes, VBZ]

Antonym using WordNet is [‘dislike’]

New Sentence [John dislike the blue house at the end of the street]

Example 3:

Sentence “I am not going to forgive him”

Tokenization [‘I’, ‘am’, ‘not’, ‘going’, ‘to’, ‘forgive’, ‘him’]

POS Tagging [(I, PRP), (am, VBP), (not, RB), (going, VB), (to, TO), (forgive, ‘VB’), (‘him’, ‘PRP’]

Word after ‘not’ and ‘going to’ is: [‘forgive’, ‘VB’]

Antonym using WordNet is [‘blame’]

New Sentence [I am going to blame him]

Table 1 presents excerpts of the clean text obtained after applying preprocessing step, DPre, as discussed above.

Table 1

Excerpts of twitter dataset after preprocessing

Original text	Preprocessed text
Just had a real good moment, i missssssssss him so much	Just had real good moment, i miss him so much.
@ppinheiro76 umm $\ldots$ not really! i’m just more fond of making out with @isacosta through events and that would just make you all nervous.	Ppinheiro76 umm not really i am just more fond of making out with acosta through events and that would just make you all nervous.
@ricodaniels why dont you pop and see me while your in england.	Rico daniels why do not you pop and see me while your in england.
@comeagainjen http://twitpic.com/2y2lx-http://www.youtube.com/ watch?v=zoGfqvh2ME8	Come again jen.
Depression is something i dothn’t speak abt even going through it because it’s also such a double edged sword.	Depression is something i do not speak about even going through it because it is also such double edged sword.
I think Im ready for a nap, partying it up tonight at J lounge in downtown LA.	I think I am ready for nap, partying it up tonight at lounge in downtown los angeles.
Ill let you know next the time im in town	Is will let you know next the time i am in town.
@SilkCharm re: #nbn as someone already said, does fiber to the home mean we will all at least be regular now	Silkcharm reply nbn as someone already said does fiber to the home mean we will all at least be regular now
@AngMoGirl oh don’t worry, not gonna be too bad cos its a Sunday!	An gmo girl oh do not worry, not going to be too bad because its sunday.
Tryna listen to happy songs instead of sad songs to try and undo my depression oof.	Trying listen happy song instead of sad songs to try undo my depression off.
I had a fun night! got to see @nevershoutnever and andy from @holidayparade! yaye! then @waitrewindthat took me to ihop yum!.	I had fun night got to see never shout never and andy from holiday parade yaye then wait rewind that took me to ihop yum.
@eterna1dreamer She didn’t pick up!!! I’m trying to escalate it to someone above her. Not gonna leave it that easy!.	Eterna1 dreamer she did not pick up i am trying to escalate it to someone above her. Not going to leave it that easy.

Table 2

Evaluation results of readability score for various categories of DPre

Test cases	Preprocessing approaches	Flesch reading ease	Flesch_kincaid score	Coleman_liau score	Dale_Chall score
Batch 1	Category 1	75.26	4.37	7.07	9.47
	Category 2	78.06	4.13	6.34	7.99
	Category 3	78.32	4.10	6.24	7.82
	Category 4	83.61	3.45	5.18	7.89
Batch 2	Category 1	79.20	3.83	7.01	8.12
	Category 2	81.14	3.62	5.58	7.51
	Category 3	81.28	3.60	5.42	7.35
	Category 4	85.33	3.11	4.62	7.55
Batch 3	Category 1	80.80	3.62	5.11	8.75
	Category 2	80.81	3.63	4.61	8.13
	Category 3	81.38	3.55	4.46	7.91
	Category 4	86.50	2.91	3.51	8.03
Batch 4	Category 1	81.66	3.49	5.93	8.94
	Category 2	80.54	3.90	6.81	8.31
	Category 3	81.46	3.77	6.68	8.14
	Category 4	87.23	3.09	5.97	8.36
Batch 5	Category 1	74.63	4.44	6.84	9.68
	Category 2	73.15	4.89	6.52	8.79
	Category 3	73.98	4.77	6.34	8.53
	Category 4	83.15	3.65	4.60	8.57
Batch 6	Category 1	81.04	3.44	5.61	9.40
	Category 2	81.42	3.34	5.37	7.95
	Category 3	81.93	3.28	5.29	7.79
	Category 4	89.12	2.38	3.88	8.16
Batch 7	Category 1	78.14	4.03	6.37	8.76
	Category 2	78.14	4.03	6.37	8.76
	Category 3	80.27	3.67	5.48	7.78
	Category 4	85.48	3.02	4.45	8.02
Batch 8	Category 1	80.38	3.61	5.26	9.00
	Category 2	80.37	4.09	5.58	8.02
	Category 3	80.59	4.08	5.53	7.85
	Category 4	87.63	3.25	4.03	8.10
Batch 9	Category 1	80.41	3.88	6.47	9.02
	Category 2	81.42	3.67	5.52	8.40
	Category 3	81.55	3.66	5.40	8.23
	Category 4	86.73	3.06	4.25	8.34
Batch 10	Category 1	77.09	4.17	6.48	8.94
	Category 2	75.70	4.23	6.25	8.37
	Category 3	76.25	4.15	6.15	8.06
	Category 4	82.89	3.32	4.81	8.19

4. Performance measures

4.1 Assessment of readability parameters

A readability score can be used to assess the understanding of a text document. Flesch-Kincaid, Flesch Reading Ease, SMOG, Fry, Fog, and Dale-Chall are some of the most commonly used equations. Each of these formulae employs a different mathematical equation to calculate the required reading grade level for the reader to comprehend the written material [30]. We used four metrics to assess the document’s readability: Coleman-Liau index, Dale-Chall index, the Flesch Reading ease, and the Flesch Kincaid score. Below is a brief overview of these measurements.

A. Flesch reading ease score – The Flesch metric [31] provides difficult to easy reading scores between 0–100 respectively. The score measures the difficulty level to read a document and takes into account the length of words and sentences, as well as the average number of syllables as shown in Eq. (4.1). The higher the score, the better readability of the document and less difficulty level.

FleschReadingScore $\displaystyle\quad=[206.83-1.01(\textit{totalwords/total sentences)}$ $\displaystyle\quad-84.6(\textit{totalsyllabus/totalwords})]$ (1)

For instance, if the Flesch Reading Score is 85, it indicates that a person can understand sentences with 85% ease or more.

B. Flesch_kincaid score – It covers 0–100 Flesch_ kincaid read score [32] into grade level with the same scale. It can also refer to the number of years of schooling needed to comprehend the material. The grade level is calculated using Eq. (4.1):

Flesch Kincaid Score $\displaystyle\quad=[0.39(\textit{total words/total sentences})$ (2) $\displaystyle\quad+11.8(\textit{total syllabus / total words})-15.59]$

For example, a text document with a 2.9 score on this test, would be considered understandable by anyone with a 2nd or 3rd-grade reading level.

C. Coleman-Liau index – It is based on characters rather than syllables. Because syllables are more difficult to specify than letters, this is a little simpler for some machines. It considers the number of characters in each word. The Coleman-Liau Index, like the Flesch-Kincaid Index, determines the grade level of reading comprehension necessary to comprehend a work [33]. For instance, if the text document gets a score of 6.2, means that the document is understandable by 6th-grade level. The Eq. (4.1) is used to calculate the score:

Coleman-LiauIndex $\displaystyle\quad[5.89(\textit{characters/words})$ $\displaystyle\quad-0.3(\textit{sentence/ words})-15.8]$ (3)

D. Dale-Chall readability score – The Dale–Chall approach measures reading difficulty based on measures administered to fourth graders, rather than syllables or characters. It uses a preset set of “common" words and the ratio of “difficult" words to words per sentence to determine how difficult a text is to read. This algorithm relies on a predetermined set of words. The values correspond to the grade level as discussed in [34], that a reader is expected to achieve in order to complete the task. The Eq. (4.1) is used as follows:

Coleman-LiauIndex $\displaystyle\quad=[0.1579(\textit{difficult words/words}*100)$ $\displaystyle\quad+0.0496(\textit{words/sentences})]$ (4)

4.2 Results

Results were examined for 500 sentences in the Twitter dataset. A readability test on social media text provides an appropriate level of statistics that presents a significant level of insight into document reading level. The results displayed in Table 2, show a significant level of readability improvement in our dataset after applying DPre.

When data is fed into a learning algorithm, it needs to be presented in an easy-to-understand manner. As social media text is very noisy, first we need to handle preprocessing tasks. Its readability is checked so that clearer and more informative language vectors can be provided to the learning algorithm. Moreover, there is some restriction on the minimum number of words required for calculating the readability score. So, tweets are fed in a batch of 50 tweets each. Also, we created four categories for obtaining the results:

Category 1 – considers the raw tweets.

Category 2 – considers the text after applying hyperlink removal, special character removal, slang replacement, contraction replacement, Emojis/Emoticons replacement, and elongated word replacement.

Category 3 – considers the text after applying hyperlink removal, special character removal, slang replacement, contraction replacement, Emojis/Emoticons replacement, word segmentation, and elongated word replacement.

Category 4 – considers the text after applying all proposed preprocessing steps.

Figure 2.

(A) Flesch Reading Ease (FRE) readability for each set of sentences. Lower scores indicate less readability. (B) Flesch Kincaid Score – Lower scores indicate high readability. (C) Coleman-Liau Score – Lower scores indicate high readability (D) Dale-Chall (DC) readability for each set. Higher scores indicate less readability.

As shown in Fig. 2A–D, there is an increasing score for Flesch reading ease in the processed text while in the other three measures, the score is in decreasing pattern for the processed text. Therefore, readability improved if the Flesch reading score is high on the processed text and extremely readable if the rest measures score is less on processed text.

As a result, the textual data created by depressive users can be used and exploited appropriately in the development of systems for detecting depression. However, if such data is used without appropriate preparation, the results may be unacceptable. Hence, we conducted a comparison examination of several data preparation approaches in the depression dataset so that they can be used effectively in identifying depressive symptoms.

5. Conclusion and future work

Often people who suffer from depression rely on informal or open forums, like social media, to communicate about their mental health state instead of seeking the expert help, maybe due to a sense of shame, humiliation or lack of understanding about the disease. As a result, the textual data created by these users can be used in the development of early-stage depression detection systems. Using such data without sufficient preparation is risky as social media text is extremely noisy. Because of the open platform, individuals use informal language such as slang and short abbreviations, etc., which makes it critical to efficiently preprocess the textual information, otherwise, the information generated by textual form on social media would be lost. However, such platforms are quite informative and therefore, essential to gaining an insight into an individual mental state or activity. This information may help clinicians and psychologists to make better-informed decisions and help people to take considerable and preventive measures from it.

In this paper, we have proposed high-level preprocessing techniques, DPre, such as hypertext removal, special characters removal, slang words handling, contraction handling, segmentation of complex words, replacement of elongated words, spelling check, language translation, and negation handling. It is observed that adequate preprocessing techniques play an important role in cleaning the text and improving the readability of the document. Better the understandability of the data and better the computational models’ performance. And, hence, it can help the patient, his relatives, and medical practitioner in a better way.

In future, we plan to use these preprocessing techniques with depression datasets with multidimensional feature sets. In order to achieve high accuracy in early-stage depression detection predictive models, other preprocessing approaches will be investigated as well.

References

Kristen

Martha

Elizabeth

Stephen

. Reviewing the data security and privacy policies of mobile apps for depression. Internet Interv. 2019. doi: 10.1016/j.invent.2018.12.001.

Rohizah

Khairuddin

Shahrul

AMN

Mohd

SNMD

. A survey on mental health detection in online social network. Int. J. Adv. Sci. Eng. Inf. Technol. 2018; 4-2. doi: 10.18517/ijaseit.8.4-2.6830.

Huijie

Jia

Jiezhong

Yongfeng

Guangyao

Lexing

, et al. Detecting stress based on social interactions in social networks. IEEE Trans. Knowl. Data Eng. 2017; 9. doi: 10.1109/TKDE.2017.2686382.

Mariam

Arafat

Ghazi

. The Effects of Natural Language Processing on Big Data Analysis: Sentiment Analysis Case Study. ACIT. 2019; 1-7. doi: 10.1109/ACIT.2018.8672697.

Amandeep

Malka

Beulah

. An Analysis of Demographic and Behavior Trends Using Social Media: Facebook Twitter, and Instagram. Social Network Analytics. 2019.

Sara

Alan

Preslav

Veselin

. SemEval-2014 Task 9: Sentiment Analysis in Twitter. Proceedings of the 8th International Workshop on Semantic Evaluation. 2014. pp. 73-80. doi: 10.3115/v1/s14-2009.

Hong

Minsik

Habin

Ruth

. Mining service quality feedback from social media: A computational analytics method. Gov. Inf. Q. 2021; 2. doi: 10.1016/j.giq.2021.101571.

Zhan

. Social media driven public health informatics: Applications in regulatory science. Diss. Abstr. Int. Sect. B Sci. Eng. 2020; 7-B.

Hansi

Christopher

Adam

, et al. Mining Twitter to assess the determinants of health behavior toward human papillomavirus vaccination in the United States. J. Am. Med. Informatics Assoc. 2020; 2. doi: 10.1093/jamia/ocz191.

10.

. Social Media Signals for Post-traumatic Stress and Anxiety in Crisis-Inflicted Communities. NIH. 2014.

11.

Sho

Yusuke

Fumio

Kosuke

Yuichi

Hiroyuki

. Recognizing depression from twitter activity. 2015; 3187-3196. doi: 10.1145/2702123.2702280.

12.

Guangyao

Jia

Liqiang

Fuli

Cunjun

Tianrui

Tat-Seng

Wenwu

. Depression detection via harvesting social media: A multimodal dictionary learning solution. IJCAI International Joint Conference on Artificial Intelligence. 2017. pp. 3838-3844. doi: 10.24963/ijcai.2017/536.

13.

Tajinder

Madhu

. Role of Text Pre-processing in Twitter Sentiment Analysis. Procedia Computer Science. 2016. pp. 549-554. doi: 10.1016/j.procs.2016.06.095.

14.

Milagros

Jonathan

Silvia

Enrique

Francisco

JGC

. Creating emoji lexica from unsupervised sentiment analysis of their descriptions. Expert Syst. Appl. 2018; 74-91. doi: 10.1016/j.eswa.2018.02.043.

15.

Kranti

Ketan

. Comparative analysis of effect of stopwords removal on sentiment classification. 2016; 1-6. doi: 10.1109/IC4.2015.7375527.

16.

Alexandra

Mans

Laure

David

. Understanding Text Pre-Processing for Latent Dirichlet Allocation. Proc. 15th Conf. Eur. chapter Assoc. Comput. Linguist. 2017.

17.

CSPavan

Dhinesh

BLD

. Novel text preprocessing framework for sentiment analysis. Smart Innovation, Systems and Technologies. 2019; 309-317. doi: 10.1007/978-981-13-1927-3_33.

18.

Rafael

Steven

Fred

Marcelo

. Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. 2016. 39(c): 1-28. doi: 10.1016/j.csl.2016.01.003.

19.

Bjarke

Alan

Anders

Iyad

Sune

. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and Sarcasm. EMNLP. 2017; 1615-1625. doi: 10.18653/v1/d17-1169.

20.

Nhathai

Soon

Manasi

James

. Enabling real-Time drug abuse detection in tweets. ICDE. 2017; 1510-1514. doi: 10.1109/ICDE.2017.221.

21.

Paul

. An Argument for Basic Emotions. Cognition and Emotion. 1992; 169-200. doi: 10.1080/02699939208411068.

22.

Xuetong

Martin

Thomas

Suzanne

. What about mood swings? Identifying depression on Twitter with temporal measures of emotions. In: WWW ’18 Companion: The 2018 Web Conference Companion. 2018.

23.

Peter

Philippe

. Emotion dynamics. Current Opinion in Psychology. 2017; 22-26. doi: 10.1016/j.copsyc.2017.06.004.

24.

Tara

Shikha

. A dimensional representation of depressive text. Lecture Notes on Data Engineering and Communications Technologies. 2021.

25.

http//www.nltk.org/book/ch02.html.

26.

https//wordnet.princeton.edu/.

27.

https//www.kaggle.com/kazanova/sentiment140.

28.

Matthieu

Anthony

Patrick

. Discriminative strategies to integrate multiword expression recognition and parsing. in 50th Annual Meeting of the Association for Computational Linguistics; ACL 2012 – Proceedings of the Conference. 2012.

29.

Nicola

Mauro

Marcello

. Statistical machine translation of texts with misspelled words. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2010. pp. 412-419.

30.

Natalie

. Readability, suitability, and writing for clients with limited literacy skills. J. Soc. Work. 2019; 5. doi: 10.1177/1468017318767091.

31.

Rudolf

. How to write plain English. English. 2004.

32.

Omeed

Daipayan

Naif

George

Nir

Aria

. Readability and quality of wikipedia pages on neurosurgical topics. Clin. Neurol. Neurosurg. 2018; 166: 66-70. doi: 10.1016/j.clineuro.2018.01.021.

33.

Pascual

Ángela

. Readability indices for the assessment of textbooks: a feasibility study in the context of EFL. Vigo Int. J. Appl. Linguist. 2019; 16. doi: 10.35869/vial.v0i16.92.

34.

Jade

Patrice

. Flesch and dale-chall readability measures for INEX 2011 question-answering track. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2012; 235-246. doi: 10.1007/978-3-642-35734-3_22.

35.

Tara

Shikha

. Depression detection: approaches, challenges and future directions. Artificial Intelligence, Machine Learning, and Mental Health in Pandemics: A Computational Approach. 2022; 209-234. doi: 10.1016/B978-0-323-91196-2.00002-8.