Abstract
Emotion classification is a research field that aims to detect the emotions in a text using machine learning methods. In traditional machine learning (TML) methods, feature engineering processes cause the loss of some meaningful information, and classification performance is negatively affected. In addition, the success of modelling using deep learning (DL) approaches depends on the sample size. More samples are needed for Turkish due to the unique characteristics of the language. However, emotion classification data sets in Turkish are quite limited. In this study, the pretrained language model approach was used to create a stronger emotion classification model for Turkish. Well-known pretrained language models were fine-tuned for this purpose. The performances of these fine-tuned models for Turkish emotion classification were comprehensively compared with the performances of TML and DL methods in experimental studies. The proposed approach provides state-of-the-art performance for Turkish emotion classification.
1. Introduction
Emotion classification is a subfield of text mining. It classifies a word, sentence or text according to the emotions contained in the discourse. Generally, the basic emotion models of Ekman [1] and Plutchik [2] are used in emotion classification studies. The Ekman [1] emotion model consists of six basic classes of emotions: ‘anger, sadness, joy, disgust, fear and surprise’. Moreover, the Plutchik [2] emotion model is composed of eight basic emotions: ‘anger, sadness, joy, disgust, fear, surprise, trust and anticipation’.
The feature engineering processes in traditional machine learning (TML) methods cause some important information in the text to be lost, which decreases the classification performance [3]. Some emotion classification methods in which the inputs are represented, such as the bag of words [4], cause the loss of the order and integrity of the words in the sentence [5].
Deep learning (DL) methods take the raw text as input and preserve the context of the words that coexist. In this way, it is possible to classify emotions more successfully than when using TML methods. However, the performance of DL methods depends on the size of the data set used. Nevertheless, the number of data sets for emotion classification is very limited in the literature.
Turkish is a morphologically rich language since it is an agglutinative language and has the ability to produce new words with the help of suffixes. Relatively more samples are needed to create a DL model for Turkish text analysis studies. Thus, it is necessary to use as many written sources as possible to cover the whole vocabulary while training a DL model for Turkish.
In recent years, word embeddings and pretrained language model approaches have changed the direction of studies in natural language processing fields. The training process does not start from scratch thanks to the use of these methods in deep networks. Thus, the generalisation of the model and the learning of the problem are accelerated. In this context, there are sentence level–based methods [6, 7] and word embedding methods [8,9,10] providing prior knowledge at the word level. General purpose pretrained language model approaches [11, 12] were developed as a result of studies aiming to transfer more prior knowledge from word meanings. Pretrained language models aim to learn semantic, syntactic and grammar rules and usage habits along with the word meanings in the text. The requirement of a large-sized emotion data set in DL methods is reduced thanks to the pretrained language models.
In this study, emotion classification was performed on Turkish texts using pretrained language models. The pretrained language models, which are the Bidirectional Encoder Representations from Transformers (BERT) [11] and BERTurk [13] developed for Turkish, were adapted to the emotion classification problem for Turkish. TML, DL and the two pretrained models were applied to the problem. The obtained performances of the applied models were investigated comparatively. It was observed that the pretrained language models are superior to the others.
1.1. Related studies
Emotion classification studies are mainly conducted for the English language [14,15,16,17]. Therefore, most of the available data and tools are for English. There are various studies conducted on languages such as Chinese [18], Arabic [19], French [20] and Romanian [21].
There are studies using pretrained word embeddings [22] and studies [16] using pretrained language models for English. As a result of the emotion analysis performed on an English data set using BERT, the success rate measured by the F1 metric was increased from 58% to 74% [16]. Various studies have been conducted on English emotion analysis using this approach [23,24,25].
The first emotion classification study on Turkish [26] was conducted using TML in 2013. The data set used in this study was derived from the International Survey on Emotion Antecedents and Reactions (ISEAR) [27] data set in English by translating 4265 sample sentences belonging to the four classes of ‘joy, sadness, anger and fear’ into Turkish. Experiments performed using different feature combinations showed that a success rate of up to 81% was achieved in the general accuracy and F1 metrics.
Another study using TML methods [28] was conducted on Turkish Twitter posts. In the study based on Ekman’s emotion model, a data set consisting of 6000 samples was created by taking 1000 samples for each emotion. In experimental studies in which different ML methods with various feature combinations were tested, the support-vector machine (SVM) [29] achieved the highest performance in the term count metric with 70%.
The Turkish emotion data set (TREMO), which is based on Ekman’s model and consists of 27,350 samples, was created as a result of a questionnaire conducted on 4709 people in the study [30, 31]. In the experiments performed on TREMO, the highest performance was achieved with the SVM with an 86% general accuracy. In the continuation study [32], an emotion lexicon was produced using TREMO, and lexicon-based emotion analysis was performed. The lexicon-based approach proposed in the study showed competitive performance against the TML methods with an accuracy of 91%.
In another study [33], emotion classification was conducted on Turkish using DL methods. The Turkish Twitter emotion data set (TURTED) consisting of 195,000 samples was created using emotion keywords according to Ekman’s model. An artificial neural network, a convolutional neural network (CNN) and long short-term memory methods were applied on TURTED and TREMO in the experimental study. The CNN achieved the highest performance on TURTED with 74% accuracy. The methods achieved significantly better performance in the experiments conducted on the TREMO data set compared with the experiments conducted on TURTED. No classification study with pretrained language models on Turkish has been found in the related literature.
2. The proposed approach
In this study, the transfer learning method was applied to BERT and its variations, which are pretrained language models. The models were fine-tuned using the TREMO data set for emotion classification in Turkish. The scheme of the pretrained language model in Figure 1 describes the common structure of the pretrained language models used in the study. The pretrained language model was trained using the training set so that the model could classify emotions. The weights of the pretrained language model were optimised for Turkish emotion classification according to the samples in the training set. The performance of the customised model was measured using the test set.

Proposed fine-tuning method.
Transfer learning adapts the knowledge gained from a source task to the target task. Transfer learning can be performed in many types. However, we use pretrained language models adapted to a downstream task in this study. The adaptation process is performed in three steps. In the first step, in order to conduct transfer learning with a pretrained language model, a suitable model is selected for the task. We selected BERT for this study because it is a successful model for natural language understanding tasks. The second step is the selection of an appropriate layer of the model for adaptation. The last layer was selected as the adaptation layer due to its simplicity and effectiveness. In the third step, the transfer strategy between fine-tuning (unfrozen parameters) and feature extraction (frozen parameters) should be determined. Fine-tuning was chosen as the transfer strategy in this study because it is more suitable for classification tasks.
In this study, the prior knowledge in pretrained language models in the BERT architecture was fine-tuned for emotion classification. The [CLS] token represents the entire sentence in the BERT representation, and it is created for classification tasks. The [CLS] token in the last hidden layer of the model was used for fine-tuning in this study. A softmax layer was located next to the [CLS] token to determine the probability that the entry belongs to class c (equation (1))
where W is the weight matrix and
The pretrained language models, the multilingual BERT [11] and BERTurk [13], were adapted to emotion classification in this study. The BERT [11] and DistilBERT [34] language models pretrained in English were used to verify the proposed approach.
The SimpleTransformer 1 library using the transformer [35] infrastructure was used for the fine-tuning operations in the implementation of the proposed approach. The system parameters were set as ‘batch-size:8’, ‘learnin-rate:4e-5’ and ‘num-train-epoch:5’. Calculations were conducted on the graphics processing unit (GPU) in the Colab 2 infrastructure. Subsequently, the trained models were tested on the test sets, and the evaluation metrics and results were reported.
3. Data sets
TREMO [30], the emotion data set in Turkish created and verified manually for Turkish emotion classification, was used. The Twitter Emotion Corpus (TEC) [36] and Blogs [37] data sets used for English emotion classification were also used for the verification. The data sets were split into three parts as ‘train, dev and test’ for the experiments (Table 1).
Data sets’ statistics.
The TREMO Turkish emotion data set is labelled with Ekman’s six emotions and has a balanced distribution. The data set was created as a result of a questionnaire conducted on 4709 people. In the questionnaire, the volunteers were asked to briefly write a memory or experience for each emotion of Ekman’s model. The 27,350 collected entities were labelled by at least three different people. After labelling, unanimous and majority entries were eliminated.
The TEC data set consists of 21,024 Twitter posts labelled with Ekman’s emotion classes for English emotion classification. The TEC data set was created by selecting posts containing hashtags such as #anger, #happy and #fear without expert evaluation. The Blogs data set consists of 4090 sentences selected from 173 blogs. The Blogs data set has been labelled with the classes of Ekman’s emotions and neutral by some experts who assigned only one label to each sentence. A total of 2800 sentences in the neutral class were eliminated for the experiments in this study.
4. Experimental studies
The experiments were conducted using the TML, DL, and DL with word embedding methods and fine-tuned pretrained language models consecutively. Moreover, all the experiments were repeated by taking both the full word and lemma in order to measure the effect of lemmatisation in Turkish. Turkish Stemmer 3 was used for the lemmatisation process.
First, emotion classification on Turkish was performed using the Naive Bayes (NB) [38] and SVM [29] TML methods in order to create a baseline and make some comparisons. Since raw text cannot be input in conventional methods, a feature selection process was performed. A total of 4000 features with the highest representation ability according to term frequency–inverse document frequency (TF–IDF) weights were selected among the words and n-grams (2, 3 and 4) in the training set and used as the inputs in the TML experiments.
In the experiments performed with the GRU [39], LSTM [40] and attention [35] DL algorithms, emotion classifications were conducted using their networks bidirectionally. These DL experiments were repeated after the weights of FastText [10] pretrained word embeddings were appended to the embedding layer of the deep networks. Raw text was used as the input, and the models were trained for 30 epochs in the experiments. Softmax [41] was used as the activation function, Kullback–Leibler divergence [42] was used as the loss function, and Nadam [43] was used as the optimiser.
Finally, the last experiments were conducted using the pretrained language models that were adapted to emotion analysis. The input texts were converted into BERT representations. In addition, all experiments were also performed on the TEC and Blogs English emotion data sets in order to validate the developed model in terms of the same hyperparameters. The obtained results were compared with the results obtained on TREMO.
5. The performance metrics
The performance of the proposed approach was measured with the precision (P), recall (R), F1 and accuracy (Acc) (equations (2)–(5))
where TP represents the true positives, TN represents the true negatives, FP represents the false positives, and FN represents the false negatives.
6. The results
The proposed pretrained language model adapted to emotion classification in Turkish achieved a success rate of 92% as measured by the Acc on TREMO (Table 2). This score is the state-of-the-art result in the literature.
Turkish emotion classification results with the fine-tuned BERTurk [13] model.
As seen in Table 2, a success rate of 89% was achieved, even for the most unsuccessful class ‘surprise’. Furthermore, a success rate of 95% was achieved for the ‘disgust’ and ‘happy’ classes. The performances achieved in all classes are similar in both the macro- and weighted-average results. This indicates that the performance of the model is independent of the number of samples belonging to the classes.
Table 3 shows the sample numbers of all classes in the TREMO data set and the confusion matrix of the proposed model. The numbers of correctly predicted labels in the classes are quite high. The lowest performance was obtained for the ‘surprise’ class with the smallest sample. The relatively low success rate in this class is related to the number of samples considering that the numbers of samples in the training and dev sets also have similar distributions (Table 1).
Test set confusion matrix.
Four pretrained language models were adapted to the emotion classification in Turkish. The multilingual BERT was developed for languages that have sufficient resources on Wikipedia. The resources and methods specific to Turkish were not used in the development of this model. The DistilBERT approach, which aims to reduce the size of the pretrained language model by distilling it, uses the BERT model as a source model. BERTurk, which is developed using Turkish-specific resources and tools [13], is more successful in Turkish emotion classification. DistilBERTurk is the distilled version of BERTurk. Although the size of DistilBERTurk is quite small, the model yields competitive results (Table 4).
Turkish emotion classification results with the fine-tuned pretrained language model.
The performances of the DL methods, whose training process starts from scratch, depend on the numbers of samples and epochs. The DL methods and the DL methods weighted with FastText embeddings were compared (Table 5). Pretrained word embeddings provided very strong prior knowledge. DL with FastText achieved a success rate that was approximately 7% better than that of DL without FastText according to the results.
Effects of word embeddings.
DL: deep learning.
The proposed pretrained emotion model is more successful than the other learning methods in [30], as seen in the comparison in Table 6. The effects of using words and lemmas as input on the performance of the SVM were also investigated. Using lemmas as input does not have a significant effect on the results. When the performances of the SVM and Bi-GRU are compared, there are no significant differences in the results. However, TML methods require more preprocessing, such as feature engineering. However, GRU with FastText has a better performance than SVM.
Proposed model versus others.
SVM: support-vector machine.
All experiments were conducted on the TEC and Blogs data sets in English to verify the created emotion classification model. The obtained results for these data sets are similar to the results for TREMO (Figure 2). TML methods have the lowest performance below DL methods, similar to the Turkish methods. The performances of the DL methods increased drastically when word embeddings were transferred to them. When the transferred prior knowledge was enhanced using pretrained language models, the highest performances were obtained.

Emotion classification results.
The results of the experiments on the English data set have a similar pattern to the experiments on TREMO (Table 7). The emotion classifications on the Turkish data set are more successful than those on English data sets. This occurs because TREMO was created and verified manually while the TEC and Blogs data sets were derived from social media posts and blogs, respectively. The fact that some classes have very few samples in English data sets (Table 1) also causes this difference to be larger. Furthermore, 30 epochs are sufficient to learn the Turkish emotion model but not the English model in DL methods according to the results on the training data set.
Turkish and English results with the macro-average F1 metric.
SVM: support-vector machine; NB: Naive Bayes.
The proposed pretrained emotion classification model for Turkish yields more successful results than the results of other methods and previous studies. State-of-the-art results were obtained as a result of the experiments without using any attribute specific to the emotional context.
7. Conclusion
In this study, we aimed to classify emotion in Turkish texts with a pretrained language approach. Experimental studies show that the proposed pretrained emotion model has the highest success rate compared with the methods used in previous studies. This study is noteworthy due to the unique nature of Turkish and the lack of studies on this language in this field. In addition, the models proposed for multilingual and other languages are inadequate for Turkish texts. The proposed study will enable high-level artificial intelligence tasks such as more accurate public opinion polling, customer relationship management, brand management, cyberbullying detection, the determination of election tendencies and the determination of partisan comments in Turkish media. In addition, it will be a resource for researchers in this field and can be used in Turkish text mining tasks. Future studies should aim to support the proposed models with features specific to the emotional context and thus to perform more successful emotion analysis. Future research should also aim to adapt the proposed model to other text mining subtasks such as emotion analysis, named entity recognition and question answering.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
