Incorporating emoji sentiment information into a pre-trained language model for Chinese and English sentiment analysis

Abstract

Emojis in texts provide lots of additional information in sentiment analysis. Previous implicit sentiment analysis models have primarily treated emojis as unique tokens or deleted them directly, and thus have ignored the explicit sentiment information inside emojis. Considering the different relationships between emoji descriptions and texts, we propose a pre-training Bidirectional Encoder Representations from Transformers (BERT) with emojis (BEMOJI) for Chinese and English sentiment analysis. At the pre-training stage, we pre-train BEMOJI by predicting the emoji descriptions from the corresponding texts via prompt learning. At the fine-tuning stage, we propose a fusion layer to fuse text representations and emoji descriptions into fused representations. These representations are used to predict text sentiment orientations. Experimental results show that BEMOJI gets the highest accuracy (91.41% and 93.36%), Macro-precision (91.30% and 92.85%), Macro-recall (90.66% and 93.65%) and Macro-F1-measure (90.95% and 93.15%) on the Chinese and English datasets. The performance of BEMOJI is 29.92% and 24.60% higher than emoji-based methods on average on Chinese and English datasets, respectively. Meanwhile, the performance of BEMOJI is 3.76% and 5.81% higher than transformer-based methods on average on Chinese and English datasets, respectively. The ablation study verifies that the emoji descriptions and fusion layer play a crucial role in BEMOJI. Besides, the robustness study illustrates that BEMOJI achieves comparable results with BERT on four sentiment analysis tasks without emojis, which means BEMOJI is a very robust model. Finally, the case study shows that BEMOJI can output more reasonable emojis than BERT.

Keywords

Pre-trained language model emoji sentiment analysis implicit sentiment analysis prompt learning multi-feature fusion

1. Introduction

1.1 Background and Motivation

Sentiments are subjective views, appraisals, evaluations and feelings [1]. Liu [2] first divided sentiment into explicit sentiment [3, 4] and implicit sentiment [1, 5]. For explicit sentiment analysis tasks, there is some research, such as personality traits detection [6, 7], emotion analysis [8, 9] and stress detection [10]. For implicit sentiment analysis, it studied the sentiments of language fragments, including sentences, clauses and phrases, which express the subjective sentiment with hidden explicit sentiment clues [1, 5, 11, 12, 13]. There are few works on inferring implicit sentiment from context [5, 14], on account of implementation difficulties [5]. Specifically, Liao et al. [12] divided implicit sentiments into four categories: fact-implied type, metaphorical type, rhetorical-question type and ironic type. The existing models hardly distinguished the sentiment orientations of language fragments for each category.

Meanwhile, existing implicit sentiment analysis (ISA) methods mainly focused on mining the inherent information, including contextual, semantic and syntactic information [12, 15], and introducing external knowledge such as commonsense knowledge [1] and sentiment lexicon [5] in sentences. However, they ignored the explicit sentiment clues inside the language fragments, especially emojis. Since emojis contain lots of sentiment information, a sentence with different emojis can express different sentiments. For instance, as shown in Fig. 1, the three sentences have the same text but express three different sentiment orientations because E1 contains no emoji, E2 has a smiley face, and E3 has a scary face. For the existing implicit sentiment analysis models, they either delete emojis directly, like I am going shopping, or replace emojis as unique tokens, like I am going shopping [UNKNOWN] (UNK) . Therefore, existing ISA models do not perform well when facing texts with emojis.

Figure 1.

Examples of the same text with different emojis. E1 has no emoji, and expresses a neutral sentiment; E2 has a smiley face, and expresses a positive sentiment; and E3 has a scary face, and expresses a negative sentiment.

1.2 Gap analysis and challenges

Emoji sentiment analysis methods can be divided into two categories. (1) Training emojis with texts. These methods treated each emoji as a unique special token and learned the word embedding of this token during pre-training or fine-tuning, such as DeepMoji [16] and SEntiMoji [17]. These models learned the co-occurrence relationship between text representations and emoji embeddings but failed to utilize sentiment information carried by emojis. (2) Utilizing emoji descriptions. Gavilanes et al. [18, 19] adopted emoji descriptions to complement the sentiment information of emojis. However, these methods have yet to use powerful pre-trained language representation models to obtain better text representations.

We will utilize and fuse the above two methods to complete the lacked information or ignored sentiment information in texts. There are two main challenges: (1) Emoji description encoding. For a given text, it is vital to effectively extract and encode the information of the emoji descriptions in the text. (2) Multi-feature fusion. Since emoji description representations and text representations belong to two different vector spaces, how to fuse these two individual vectors is another challenge.

1.3 Solutions and contributions

To overcome the above-mentioned challenges, we propose a pre-training Bidirectional Encoder Representations from Transformers (BERT) model with emojis for sentiment analysis (BEMOJI). This method utilizes the sentiment information inside emoji to complete the lacked or ignored sentiment clues and analyze implicit sentiment. It pre-trains a language representation model on texts with emojis and fine-tunes on downstream tasks by fusing emoji description representations and text representations.

Pre-training. For the pre-training stage, we adopt two strategies to pre-train BEMOJI to learn the interdependence between tokens and the relationship between text and emoji sentiment information. For the first strategy, we utilize masked language modeling (MLM) as the first pre-training objective. We randomly mask some tokens in the given sentence and let the pre-trained language model predict the masked token. For the second strategy, we adopt prompt learning and let our model predict the most suitable emoji description for the given text. Since there are some associations between an input text and emojis, we use prompt learning to force the model to learn them. We design a prompt template as $[Z][X]$ , where $[X]$ is an input text, and $[Z]$ is the answer slot that will fill the emoji descriptions.

Fine-tuning. At the fine-tuning stage, we utilize two individual BERTs to encode emoji descriptions and texts. Then we use one fusing layer to map these two different representations into one vector space. At last, we feed the fused representation into one classifier and obtain the predicted sentiment label.

The main contributions of our work are expressed as follows: (1) Prompt learning is adopted to predict the corresponding emoji descriptions of input sentences at the pre-training stage. (2) A knowledge fusion layer is adopted to combine the emoji descriptions and input sentence representations to predict the sentiment of the input texts at the fine-tuning stage. (3) A pre-training BERT model with emojis for sentiment analysis (BEMOJI) is proposed. (4) Abundant experiments verify that BEMOJI outperforms existing emoji-based methods and pre-trained language models on Chinese and English datasets.

2. Related work

2.1 Implicit sentiment analysis

Existing ISA methods mainly focused on mining contextual, syntactic and semantic information or injecting external knowledge. Liao et al. [12] used the multi-level semantic fusion model to capture explicit context representation, implicit opinion expression and implicit target representation from semantic dependency trees. Compared with other baseline models, their method achieved the best scores on implicit sentiment analysis and unsentimental sentence classification tasks. Zhuang et al. [15] obtained contextual, syntactic and semantic features from sentence embedding, dependency syntax tree and sentiment words of a sentence, respectively. The multi-feature neural network outperformed other RNN-based models and syntactic tree-based models. CLEAN [20] eliminated the confounding causal effects and extracted the pure causal effect between sentence and sentiment. The experimental results showed that CLEAN achieved state-of-the-art (SOTA) results on both implicit sentiment analysis and implicit aspect sentiment analysis tasks. ContextBERT [21] utilized contextual information and sentence information to analyze implicit sentiment. The experimental results showed that it achieved SOTA results on two implicit sentiment analysis tasks. Since the clues of implicit sentiment may not be obtained directly from a sentence, it is difficult to accurately identify implicit sentiment only by considering it without external knowledge.

In order to complement sentiment clues in the texts, some researchers introduce external knowledge into ISA models. Wei et al. [5] adopted BiLSTM to encode sentences for sentence representations and put them together with the words in a sentiment lexicon into the orthogonal attention layer for implicit sentiment classification. Their proposed method achieved the best scores on the implicit sentiment analysis dataset, i.e., SMP2019-ECISA, and explicit sentiment analysis datasets, including COAE and SemEval, compared to other baseline models. Zhou et al. [22] introduced an event triplet into a sentence and let the model classify event and sentiment. They constructed an event-related implicit sentiment analysis dataset, i.e., EveSA. They also used BERT to encode the event subject, event predicate, event object and sentence. Their method has two training objectives: event classification and sentiment classification. The experimental results showed that their method achieved SOTA scores on both EveSA and SemEval17 Task4 datasets. Liao et al. [1] introduced commonsense knowledge into sentences. They used TransE [23] and GAT [24] to represent commonsense knowledge and pre-trained word embedding to represent a sentence. Then they adopted BiLSTM to encode the sentence representation and knowledge representation. Third, they utilized an orthogonal attention layer to fuse a sentiment lexicon and the output of BiLSTM. At last, they used one classifier to output the predicted label. The experimental results showed that the proposed B+KG-MPOA method outperformed other implicit sentiment analysis methods. KC-ISA [25] fused context information, target sentence information and knowledge graph information to analyze implicit sentiment. The experimental results showed that KC-ISA achieved SOTA results on implicit sentiment analysis and two other sentiment classification tasks. Lin et al. [26, 64] proposed a worthwhile dynamic multiscale topological representation in representation learning. Their method significantly improved classification performance on network traffic datasets. Liu et al. [27] proposed a new cross domain sentiment aware word embedding model. Their model integrated sentiment information and domain relevance. Compared with other baseline models, their model improved the performance of comment sentiment classification.

Although these methods mentioned above could effectively utilize contextual, syntactic, semantic information and external knowledge, they treated emojis as a special token, like UNK, and ignored these explicit sentiment clues. Therefore, it is necessary to thoroughly consider contextual information, external knowledge and emojis in implicit sentiment analysis models.

2.2 Emoji sentiment analysis

The existing emoji sentiment analysis methods are divided into two categories: (1) the methods that treat emojis as unique tokens and pre-train or fine-tune their embeddings; and (2) the methods that consider the descriptions of emojis. DeepMoji [16] used a large corpus (which contains sentences and emojis) to pre-train the model by two BiLSTM layers and one attention layer. Then DeepMoji transferred the pre-trained model to the sentiment analysis task without emojis. The experimental results showed that DeepMoji could learn a lot of sentiment information from emojis and achieve SOTA results on three sentiment analysis datasets, three emotion analysis datasets and two sarcasm classification datasets. SEntiMoji [17] evaluated the performance of DeepMoji [16] in the software engineering field. The experimental results showed that SEntiMoji achieved the best results on four software engineering datasets. Based on DeepMoji [16], Li et al. [28] also considered the sentiment polarities of emojis. Their proposed EAGRU method replaced BiLSTM in DeepMoji with BiGRU. Besides, they introduced two hyperparameters as weights for the sentiment lexicon and the hidden states of DeepMoji, respectively. The experimental results showed that EAGRU achieved the best results on the Weibo dataset compared with other baseline models. Chen et al. [29, 30, 31] assigned two opposite embeddings, i.e., positive embedding and negative embedding, to each emoji and concatenated the word and emoji embeddings to create the final embedding. The experimental results showed that their methods achieved the best results on Twitter sentiment analysis datasets compared with other baseline models. Al-Halah et al. [32] used many images to train emoji embeddings. They extracted images and their corresponding emojis and let the proposed SmileyNet train the relationship between images and emojis. Then they used the pre-trained SmileyNet to fine-tune it in five different downstream tasks. The experimental results show that SmileyNet achieved many SOTA scores on the Twitter Visual Sentiment dataset [33]. Laurenceau et al. [34] studied the influence of different skin tones of emojis on sentiment analysis. EmoGraph2vec [35] adopted a graph neural network to embed emojis. Although these methods achieved remarkable results in their tasks, they still need to exploit emojis’ sentiment information.

For those methods considered emoji descriptions, Gavilanes et al. [18, 19] introduced the descriptions of emojis into a syntactic tree. They modeled the syntactic tree and emoji description. Their method utilized the sentiment information of emojis effectively. The experimental results show that their methods outperform rule-based methods on 13 language datasets. However, these methods have yet to adopt deep learning, especially pre-trained language models, to obtain better language representations. Therefore, it is crucial to fully utilize powerful pre-trained language models as the backbones and the sentiment information inside the emoji descriptions as models’ sentiment enhancement information.

2.3 Prompt learning

The pre-trained language models, such as BERT [36] and RoBERTa [37], have shown their power in many natural language processing tasks. However, they needed to introduce additional parameters and fine-tune them to satisfy downstream tasks [38]. With the help of prompt learning, these pre-trained models could predict the desired output without additional task-specific parameters or training [38]. LAMA [39] found that the SOTA models using prompt learning, without fine-tuning, could also provide relational knowledge. GLM [40] used the autoregressive generation in the pre-training stage to generate the masked spans and fine-tuned it with prompt learning for downstream tasks. The experimental results showed that GLM achieved many SOTA scores on the SuperGLUE dataset [41]. Schick and Schutze [42] manually designed many prompt templates for different downstream tasks. Their proposed PET method achieved SOTA results on four datasets with few-shot learning and zero-shot learning. Jiang et al. [43] improved BERT’s performance by manually designing and automatically searching for prompt templates. Their experiments showed that different prompt templates would get different results, and their proposed automatic searching method achieved SOTA results on seven datasets. NSP-BERT [44] solved the problem that prompt learning will map the tokens into a fixed length. It adopted next sentence prediction (NSP) as the pre-training objective of prompt learning and mapped tokens to a variable length space. The experimental results showed that NSP-BERT outperformed other manual prompt learning methods. ConnPrompt [45] adopted prompt learning to let the pre-trained language model learn the conjunctions between contexts. It achieved SOTA results compared to other pre-trained language models. Thus, prompt learning can exploit the language knowledge learned by the pre-trained language models.

3. Research objectives

In this section, we formally introduce pre-training and fine-tuning objectives.

3.1 Pre-training objectives

There are two pre-training objectives: (1) learning the interdependence between different tokens from the input text; (2) inferring the relationships between the input text and the emoji sentiment information. The former aims to obtain text sentiment information based on the words, while the latter aims to obtain emoji sentiment information based on the text and emojis for a given sentence with emojis. In order to achieve these two objectives, the specific methods are as follows.

For the first pre-training objective, given an input text $\{w_{1},w_{2},\ldots,w_{n}\}$ (where $n$ is the sequence length), we expect that a sentiment analysis model can learn the interdependence between every token. Specifically, we expect that the sentiment analysis model can infer the word $w_{i}$ according to its context. Formally, we expect the model to learn the highest ${\rm P}(w_{i}|w_{1},\ldots,w_{i-1},w_{i+1},\ldots,w_{n})$ . The formula is as follows.

$\displaystyle L_{\textit{token}}=-{\rm log}({\rm P}(w_{i}|w_{1},\ldots,w_{i-1}% ,w_{i+1},\ldots,w_{n})),$ (1)

where $L_{\textit{token}}$ is the loss of the first pre-training objective.

For the second pre-training objective, previous researchers [16, 17, 28] adopted the input text $\{w_{1},w_{2},\ldots,w_{n}\}$ to predict the corresponding emoji $\bm{e}$ . The formula is as follows.

$\displaystyle{L_{\textit{sec}}}=-{\rm log}({\rm P}({\bm{e}}|w_{1},w_{2},\ldots% ,w_{n})),$ (2)

where $L$ is the loss. However, this method cannot fully utilize the sentiment information inside the emojis. This is because a single token $e$ contains limited information, and only predicting $e$ will only allow the model to learn the relationship between $\{w_{1},w_{2},\ldots,w_{n}\}$ and $e$ , without utilizing the hidden emoji information behind $e$ . The relationship here refers to the correlation of each word $\{w_{1},w_{2},\ldots,w_{n}\}$ in the sentence with the emoji description $e$ , the hidden emoji information is the emotional information brought by the emoji description $e$ .

In order to solve this problem, we expect the sentiment model to learn the relationship between the input text and the sentiment information inside the corresponding emoji. Specifically, we adopt BEMOJI to predict the emoji description $e=\{e_{1},e_{2},\ldots,e_{m}\}$ (where $m$ is the sequence length of emoji description) via the input text $\{w_{1},w_{2},\ldots,w_{n}\}$ . The emoji description $e$ is the description text corresponding to the emoji $\bm{e}$ , which is used to explain the kind of emotion the emoji expresses. The formula is as follows.

$\displaystyle L_{\textit{emoji}}=-{\rm log}({\rm P}(e_{1},e_{2},\ldots,e_{m}|w% _{1},w_{2},\ldots,w_{n})),$ (3)

where $L_{\textit{emoji}}$ is the loss of the second pre-training objective. Then, the model can learn the interdependence between the input text and the sentiment information inside the corresponding emoji from this pre-training objective.

3.2 Fine-tuning objective

After pre-training, we need a well-trained sentiment analysis model to predict the sentiment polarity of the sentence with emojis. Previous researchers either adopt the well-trained language model to predict sentiment label $y$ only by the input text $w=\{w_{1},w_{2},\ldots,w_{n}\}$ [16, 17, 28], like the following formula.

$\displaystyle L_{ft}=-{\rm log}({\rm P}(y|w)),$ (4)

or they predict the sentiment label $y$ by the input text and the well-trained emoji embedding $\bm{e}$ [29], like the following formula.

$\displaystyle L_{ft}=-{\rm log}({\rm P}(y|w;\bm{e})),$ (5)

where $L_{ft}$ is the loss of fine-tuning. However, these methods only partially utilize the sentiment information inside the emojis.

In our fine-tuning objective, given the input text $w=\{w_{1},w_{2},\ldots,w_{n}\}$ , its corresponding emoji description $e=\{e_{1},e_{2},\ldots,e_{m}\}$ and its sentiment label $y$ , we expect the sentiment analysis model can classify the sentiment orientation by the input data. The sentiment analysis model will be updated by minimizing the following objective:

$\displaystyle L_{ft}=-{\rm log}({\rm P}(y|w;e)).$ (6)

4. Pre-training BERT model with emojis for sentiment analysis

To achieve the above research objectives in Section 3, we propose a pre-training BERT model with emojis for sentiment analysis (BEMOJI). At the pre-training stage, we pre-train BEMOJI by predicting the emoji descriptions from the corresponding texts via prompt learning. At the fine-tuning stage, we propose a fusion layer to fuse text representations and emoji descriptions into fused representations. These representations are used to predict text sentiment orientations.

4.1 Pre-training BEMOJI

The architecture of pre-training BEMOJI is shown in Fig. 2. We use two individual BERTs [36] as the backbone. The BERT that processes emoji descriptions is denoted as Emoji BERT (EBERT), and the BERT that handles texts is denoted as Contextual BERT (CBERT).

Figure 2.

The BEMOJI pre-training framework.

There are two tasks when we pre-train BEMOJI: (1) learning interdependence between different tokens from input texts; and (2) inferring the relationships between texts and emoji sentiment information.

4.1.1 Learning interdependence between different tokens from input texts

We use traditional MLM to learn the interdependence between different tokens. Specifically, 15% of the tokens in each text are selected for the following operations: (1) In 80% of the time, one selected token is replaced with [MASK]; (2) In 10% of the time, one selected token is replaced with another token in $V$ (where $V$ is the set of vocabulary); and (3) in 10% of the time, one selected token is unchanged. Then we use CBERT to predict the masked token.

For the input sentence $\{w_{1},w_{2},\ldots,w_{n}\}(w_{i}\in V)$ , CBERT will encode it into the hidden states, and its token representation, denoted by $\bm{h}_{t}$ , is as follows.

$\displaystyle\bm{h}_{t}=\{\bm{h}_{1},\bm{h}_{2},\ldots,\bm{h}_{n}\}={\rm embed% }(\{w_{1},w_{2},\ldots,w_{n}\}),$ (7)

where $\bm{h}_{t}\in\mathbb{R}^{n\times d_{b}}$ ( $d_{b}$ is the hidden size of CBERT) is a two-dimensional matrix.

When we adopt CBERT to represent the input sentence, we feed the hidden state of [MASK] into a fully connected layer and map the dimension of the hidden state to $n\times m$ . Moreover, we extract the index corresponding to the largest value, i.e., the predicted label of the sentence $\hat{y}_{t}$ .

$\displaystyle\hat{y}_{t}={\rm argmax}(\bm{W}^{O}\bm{h}_{m}^{\rm T}+\bm{b}^{O}),$ (8)

where ${\rm argmax}(\cdot)$ is the function that extracts the index of the largest value, $\bm{W}^{O}\in\mathbb{R}^{V\times d_{b}}$ is the weight matrix, and $\bm{h}_{m}\in\bm{h}_{t}$ is the hidden state of [MASK]. Finally, we use the cross-entropy loss to calculate the loss between the predicted label $\hat{y}_{t}$ and the true token $y_{t}$ . The cross-entropy loss is as follows.

$\displaystyle L_{CE}=-\frac{1}{|V|}\sum_{i=1}^{|V|}\left(y_{t_{i}}{\rm log}(% \hat{y}_{t_{i}})+(1-y_{t_{i}}){\rm log}(1-\hat{y}_{t_{i}})\right),$ (9)

where $|V|$ is the number of words in the vocabulary, and $L_{CE}$ is the cross-entropy loss of MLM.

4.1.2 Inferring the relationships between texts and emoji sentiment information

We use prompt learning to make BEMOJI learn the relationship between emoji sentiment information and texts. We first convert each text with multiple emojis to one text with one emoji. For instance, “I am going shopping [grin][wink].” will be separated into “I am going shopping [grin].” and “I am going shopping [wink].”. Then, we concatenate a [MASK] into every sentence, which stands for the emoji token, and is used to predict the emoji description. Let the input sentence be $[X]$ . Then the input sentence with the prompt template should be $\textit{[MASK]}[X]$ . The answer space $[Z]$ contains all the emoji descriptions. Moreover, the true label $y$ is the emoji description corresponding to the sentence. For example, the sentence “I am going shopping [grin].” with the prompt template and the answer space is shown in Table 1.

Table 1
The details of an example: “I am going shopping [grin].”

Symbol	Example	Description
$[X]$	I am going shopping.	The input sentence.
$\textit{[MASK]}[X]$	[MASK] I am going shopping.	The input sentence with prompt template.
$[Z]$	[grin]: Often conveys general happiness and …	The answer space.
	[blue heart]: A blue heart emoji …
	[clapping hands]: Two hands clapping emoji …
$y$	[grin]: Often conveys general happiness and …	True label.

Let the token sequence of an input sentence be $\{w_{1},w_{2},\ldots,w_{n}\}$ ( $w_{i}\in V$ ), where the second token $w_{2}$ is the masked token of prompt learning (since the first token is [CLS]), and let the BERT processing texts be CBERT. The hidden state of $w_{2}$ , denoted by $\bm{h}_{2}^{t}$ , is calculated as follows.

$\displaystyle\bm{h}_{2}^{t}={\rm CBERT}(w_{2}).$ (10)

Let the BERT processing emoji descriptions be ${\rm EBERT}$ ,1 and let the token sequence of true emoji descriptions be $\{e_{1},e_{2},\ldots,e_{m}\}$ , where $e_{i}$ is an emoji description (token). Then, the hidden states of all tokens are calculated as follows.

$\displaystyle\{\bm{h}_{1}^{e},\bm{h}_{2}^{e},\ldots,\bm{h}_{m}^{e}\}={\rm EBERT% }(\{e_{1},e_{2},\ldots,e_{m}\}),$ (11)

where $\bm{h}_{i}^{e}$ is the hidden state of the token $e_{i}$ . Then, the sentence representation is the hidden state of the token [CLS], denoted by $\bm{h}^{e}_{S}$ . Thus, we have

$\displaystyle\bm{h}^{e}_{S}=\bm{h}_{1}^{e}.$ (12)

Inspired by BYOL [46], we use the hidden state of the masked token to predict the sentence representations of emoji descriptions, which enables the model to learn the relationships between the sentence and the sentiment information of emoji descriptions. We adopt mean square error, denoted by $L_{\textit{MSE}}$ , as the loss function.

$\displaystyle L_{\textit{MSE}}=(\bm{h}_{2}^{t}-\bm{h}^{e}_{S})^{2}.$ (13)

In order to balance these two losses, we adopt Cov-Weighting [47] to automatically obtain the weight of the loss. Specifically, CoV-Weighting utilizes the mean and standard deviation of the loss to calculate the coefficient of variation (COV) of the loss. The loss weights are then dynamically adjusted based on these statistics. This approach allows the loss weights to change adaptively during training without the need for manual adjustments or additional optimization processes. The total loss, denoted as $L_{pt}$ , is calculated as follows.

$L_{pt}={\rm CovWeighting}(L_{\rm MSE},L_{\rm CE}),$ (14)

where ${\rm CovWeighting}(\cdot)$ can automatically assign weights to each loss in each iteration.

4.2 Fine-tuning BEMOJI on downstream tasks

Figure 3.

BEMOJI fine-tuning framework.

The BEMOJI fine-tuning architecture on downstream tasks is shown in Fig. 3. For each token sequence $\{w_{1},w_{2},\ldots,w_{n}\}$ with sentiment label $y_{s}$ , it contains emojis ranging 1 to $k$ . The value of $k$ varies in different token sequences. Since finding each emoji’s place in each text is a time-consuming task [48, 49], inspired by KEPLER [49], we encode texts and emoji descriptions separately. We use pre-trained ${\rm CBERT}$ and ${\rm EBERT}$ to obtain the representations of the token sequence and emoji description sequence, respectively. Let $\bm{h}^{t}$ be the sentence representation, $\bm{h}^{e}_{\textit{cls}_{i}}$ be the emoji description representation of the $i$ -th emoji of the sentence, and $\{e_{1}^{i},e_{2}^{i},\ldots,e_{m}^{i}\}$ be the token sequence of the $i$ -th emoji description of the sentence ( $i\in[1,k]$ ). Then these representations are calculated as follows.

$\displaystyle\begin{split}&\displaystyle\bm{h}^{t}={\rm CBERT}(\{w_{1},w_{2},% \ldots,w_{n}\}),\\ &\displaystyle\bm{h}^{e}_{cls_{i}}={\rm EBERT}(\{e_{1}^{i},e_{2}^{i},\ldots,e_% {m}^{i}\}).\end{split}$ (15)

Then we use max pooling to obtain the total emoji description representation of these $k$ emoji description representations, denoted by $\bm{h}^{e}$ , as follows.

$\displaystyle\bm{h}^{e}={\rm Max}(\{\bm{h}^{e}_{cls_{1}},\ldots,\bm{h}^{e}_{% cls_{k}}\}),$ (16)

where ${\rm Max}(\cdot)$ is max pooling process. Since the sentence representation and emoji description representation belong to two different vector spaces, we use one fusion layer to map them into one vector space. Then, the hidden state integrated with the sentence and emoji description representation, denoted by $\bm{h}^{f}$ , is as follows.

$\displaystyle\bm{h}^{f}=\sigma(\bm{W}^{t}\bm{h}^{t}+\bm{W}^{e}\bm{h}^{e}+\bm{b% }^{f}),$ (17)

where $\sigma(\cdot)$ is a GELU function [50]. $\bm{W}^{t}$ and $\bm{W}^{e}$ are the weight matrices of the sentence and emoji description representations, respectively. Since matrices are a type of mapping relationship, during training, we can map $\bm{h}^{t}$ and $\bm{h}^{e}$ to the same vector space, allowing them to be added together. $\bm{b}^{f}$ is the bias. Finally, we use one fully connected layer as the output layer, denoted as follows.

$\displaystyle\hat{y}_{s}={\rm Softmax}(\bm{W}^{o}\bm{h}^{f}+\bm{b}^{o}),$ (18)

where $\hat{y}_{s}$ is the predicted sentiment label, $\bm{W}^{o}$ and $\bm{b}^{o}$ are the weight matrix and bias in the output layer. After obtaining the predicted label, we can fine-tune BEMOJI with cross-entropy loss. The cross-entropy loss function of fine-tuning, denoted by $L_{ft}$ , is as follows.

$\displaystyle L_{ft}=-\frac{1}{N_{s}}\sum_{i=1}^{N_{s}}(y_{s_{i}}{\rm log}(% \hat{y}_{s_{i}})+(1-y_{s_{i}}){\rm log}(1-\hat{y}_{s_{i}})),$ (19)

where $N_{s}$ is the number of sentiment labels.

5. Experiment

In this section, we present the pre-training and fine-tuning details of BEMOJI and show experimental results on both the Chinese and English datasets.

5.1 Datasets and implementation

In order to prove that BEMOJI is applicable in different languages, we choose Chinese and English datasets as experimental datasets.

5.1.1 Pre-training datasets

Since there is no suitable Chinese corpus with emojis, we obtain Chinese data with emojis from Weibo. We capture a total of 39349 posts with emojis. When we split these data into one sentence corresponding to one emoji, there are 42073 data. As for the English dataset, we use the GitHub data [17] as our pre-training dataset. We extract 5,000 data as the experimental dataset, and the remaining 1180401 data are used as pre-training data.

For the enormous cost of training BEMOJI from scratch and the small size of pre-training data, we adopt bert-base-uncased2[36] as the start point of BEMOJI on the English dataset and bert-base-chinese3[51] as the start point of BEMOJI on the Chinese dataset.

5.1.2 Fine-tuning datasets

We manually annotate two emoji sentiment analysis datasets for Chinese and English datasets with emojis as fine-tuning datasets. In order to satisfy the independent and identical distribution of the data, we re-crawl 11636 Weibo data which is different from the pre-training Chinese dataset. Moreover, we extract 5000 GitHub data for the English dataset, in which we label 2767 data. The details of these six datasets are shown in Table 2.

Table 2
The details of fine-tuning datasets

		Train	Dev	Test
Weibo	Positive	3665	458	459
	Negative	5643	706	705
Github	Positive	1306	163	164
	Negative	907	114	113

5.1.3 Obtaining emoji descriptions

We select Weibo emojis that appear more than 100 times, totaling 64 emojis for the Chinese dataset. For the English dataset, there are 64 emojis [17]. Since the appearance of Weibo emojis is different from Github emojis, we convert Weibo emojis into Github emojis.4 Moreover, we obtain Chinese emoji descriptions from EMOJIALL5 and English emoji descriptions from Emojipedia.6 For descriptions, we remove texts unrelated to sentiment and remain the texts with sentiment information. Figure 4 shows three emoji descriptions of both languages.

5.1.4 Knowledge fusion

Since the textual representation $\bm{h}_{t}$ and the emoji description representation $\bm{h}^{e}$ belong to different vector spaces, a fully connected layer is used to map the two different representations into the same vector space. Specifically, the study utilizes Word2Vec to represent the input sentences $\{w_{1},w_{2},\ldots,w_{n}\}(w_{i}\in V)$ and the predicted emoji descriptions. Then, it calculates the cosine similarity between the sentence representations $\bm{h}_{t}$ and the emoji description representations $\bm{h}^{e}$ . In addition, average pooling is used to obtain sentence representations.

5.1.5 Implementation

We use AdamW optimizer [52] with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , $\epsilon=1e-8$ , set weight decay $1e-4$ , and use a cosine warm-up of the learning rate, with the peak learning rate of $1e-6$ . The settings of other hyperparameters are shown in Table 3. All experiments are implemented with one NVIDIA A40 (48G).

Table 3
The settings of hyperparameters. $n$ stands for the input sequence length, while $m$ represents the emoji description length

	Hyperparameter	Chinese dataset	English dataset
Pre-training	$n$	64	96
	$m$	32	90
	Batch size	128	150
	Epochs	10	10
	Hidden size	768	768
Fine-tuning	$n$	64	128
	$m$	32	90
	Batch size	64	32
	Epochs	100	100
	Hidden size	768	768

Figure 4.

Examples of three emoji descriptions in two languages.

5.1.6 Evaluation metrics

We use accuracy (Acc), macro-precision (Macro-P), macro-recall (Macro-R) and macro-F1-measure (Macro-F1) as models’ evaluation metrics. Let $TP_{i}$ be the number of samples whose true labels and predicted labels are both label $i$ ; $TN_{i}$ be the number of samples whose true labels and predicted labels are both label $j(j\neq i)$ ; $FN_{i}$ be the number of samples whose true labels are the label $i$ , and the predicted labels are the label $j$ ; and $FP_{i}$ be the number of samples whose true labels are the label $j$ , while the predicted labels are the label $i$ . Then the Macro-P, Macro-R and Macro-F1 scores can be calculated as follows.

$\displaystyle\textit{Macro-P}=\frac{1}{k}\sum_{i=1}^{k}P_{i},\textit{Macro-R}=% \frac{1}{k}\sum_{i=1}^{k}R_{i},\textit{Macro-F1}=\frac{1}{k}\sum_{i=1}^{k}F1_{% i},$ (20)

where

$\displaystyle P_{i}=\frac{TP_{i}}{TP_{i}+FP_{i}},R_{i}=\frac{TP_{i}}{TP_{i}+FN% _{i}},F1_{i}=\frac{2P_{i}R_{i}}{P_{i}+R_{i}},$ (21)

and $k$ is the number of labels.

As for the accuracy score, let the number of samples whose predicted labels are equal to true labels be $T L$ , and let the total number of samples be $N$ . Then, the accuracy score can be calculated as follows.

$\displaystyle\textit{acc}=\frac{TL}{N}.$ (22)

5.2 Fine-tuning results

We choose emoji-based methods, including DeepMoji [16], WATT-BiE-LSTM [29] and MATT-BiE-LSTM [29], and other SOTA transformer-based models as our baseline methods. For DeepMoji, WATT-BiE-LSTM and MATT-BiE-LSTM, we use 300-dimensional Word2Vec as the word embedding. For other transformer-based models, we use the pre-trained parameters from Hugging Face as their start points.

DeepMoji [16]

It has two BiLSTM layers and one attention layer. It assigns each emoji a unique embedding and trains emoji embeddings with texts.

WATT-BiE-LSTM [29]

It assigns two different embeddings to each emoji and adopts an attention mechanism to get the correlation between input words and emojis. Since there may be more than one emoji in a sentence, we adopt average pooling to obtain emoji embeddings. Meanwhile, we replace LSTM with BiLSTM.

MATT-BiE-LSTM [29]

Similar to WATT-BiE-LSTM [29], it assigns two different embeddings to each emoji. First, it adopts an attention mechanism to capture the correlation between input sentences and emojis. Then it uses another attention layer to get the correlation between input words and the outputs of the previous attention layer. We also use average pooling to obtain the emoji embeddings. Furthermore, we replace LSTM with BiLSTM.

BERT [36]

It is a typical transformer-based model. It can represent a sentence. BERT will feed the last hidden state of [CLS] into a fully connected network and output the sentiment label.

RoBERTa [37]

It is a more robust BERT. RoBERTa removes the next sentence prediction task from BERT and uses dynamic MLM to mask tokens.

XLNet [53]

It is an autoregressive language model. It randomly shuffles the order of each word in a sentence and masks a certain amount of tokens at the end of the shuffled sentence. Then it predicts these masked tokens by autoregression.

DistilBERT [54]

It is a smaller and faster BERT. It uses BERT as the teacher and only trains itself on the raw data. DistilBERT can learn the same inner representation of the Chinese or English language as BERT, while it can be faster for inference on downstream tasks.

For these transformer-based models, if we choose to pre-train them, we use the pre-training dataset to pre-train them with MLM; if we choose to fine-tune them, we adopt these models to do sentiment analysis tasks on the fine-tuning dataset and fine-tune all parameters of them; and if we choose not to fine-tune them, we also adopt these models to do sentiment analysis tasks on the fine-tuning dataset and only fine-tune the classifier of these models.

Table 4
Experimental results on the Chinese datasets (%). PT and FT stand for pre-training and fine-tuning, respectively. T and F stand for true and false, respectively. If FT is false, we only fine-tune the classifier. ^* means that we reproduce the model. The underlining data represents the best results achieved by the baseline models. Bold data indicates the best results for all models

Model	PT	FT	Acc	Macro-P	Macro-R	Macro-F1
DeepMoji^*	T	T	60.57	30.28	50.00	37.72
WATT-BiE-LSTM^*	–	T	82.47	82.33	80.57	81.21
MATT-BiE-LSTM^*	–	T	60.50	80.23	50.11	37.90
BERT (base)	F	F	87.11	86.82	86.02	86.37
	F	T	87.63	87.34	86.59	86.93
	T	F	87.80	87.66	86.62	87.06
	T	T	87.88	87.63	86.85	87.19
RoBERTa (base)	F	F	88.23	87.94	87.28	87.58
	F	T	88.45	88.26	87.46	87.81
	T	F	88.66	88.62	87.48	87.96
	T	T	88.80	88.50	87.97	88.22
RoBERTa (large)	F	F	88.66	88.22	87.98	88.09
	F	T	88.98	88.72	88.12	88.39
	T	F	87.98	87.65	87.03	87.31
	T	T	88.72	88.49	87.79	88.10
XLNet (base)	F	F	88.14	87.72	87.36	87.53
	F	T	88.45	88.11	87.65	87.86
XLNet (mid)	F	F	87.20	86.84	86.20	86.49
	F	T	87.41	86.98	86.60	86.78
DistilBERT (base)	F	F	84.97	85.32	82.95	83.78
	F	T	85.07	85.35	83.18	83.95
	T	F	84.97	85.23	83.03	83.81
	T	T	85.33	85.57	83.51	84.26
BEMOJI (ours)	F	F	90.45	90.41	89.53	89.91
	F	T	90.71	90.87	89.63	90.15
	T	F	90.54	90.77	89.37	89.95
	T	T	91.41	91.30	90.66	90.95

5.2.1 Experimental results on the Chinese dataset

On the Chinese dataset, we choose DeepMoji [16], WATT-BiE-LSTM [29], MATT-BiE-LSTM [29], BERT (base) [36], RoBERTa (base) [55], RoBERTa (large) [55], XLNet (base) [55], XLNet (mid) [55] and DistilBERT (base) [56] as our baseline methods.

Table 4 shows the experimental results on the Chinese dataset. From these results, we can observe that: (1) BEMOJI with pre-training and fine-tuning achieves the best results on the accuracy, Macro-P, Macro-R and Macro-F1 scores (91.41%, 91.30%, 90.66% and 90.95%, respectively). (2) The best results achieved by BEMOJI are 2.61%, 2.80%, 2.69% and 2.73% higher than the model with the second highest results (i.e., RoBERTa (base) with pre-training and fine-tuning) on the accuracy, Macro-P, Macro-R and Macro-F1, respectively. (3) The lowest scores achieved by BEMOJI (90.45%, 90.41%, 89.37% and 89.91%) are 1.65%, 1.91%, 1.40% and 1.69% higher than the model with the second highest results on the accuracy, Macro-P, Macro-R and Macro-F1 scores, respectively. (4) BEMOJI with pre-training and fine-tuning achieves 3.53%, 3.67%, 3.81% and 3.76% higher than BERT with pre-training and fine-tuning on the accuracy, Macro-P, Macro-R and Macro-F1 scores, respectively. (5) The performance of the emoji-based models is significantly worse than the transformer-based models. Specifically, WATT-BiE-LSTM achieves the highest accuracy, Macro-P, Macro-R and Macro-F1 scores (82.47%, 82.33%, 80.57% and 81.21% respectively) among all emoji-based models. However, the accuracy, Macro-P, Macro-R and Macro-F1 scores of WATT-BiE-LSTM are still 5.41%, 5.3%, 6.28% and 5.98% lower than BERT (base) with pre-training and fine-tuning, and 8.94%, 8.97%, 10.09% and 9.74% lower than BEMOJI with pre-training and fine-tinging, respectively. (6) Models with pre-training and fine-tuning generally perform better than these without pre-training and fine-tuning, indicating the importance of the two strategies in sentiment analysis models on THE Chinese dataset. (7) Compared to all baseline methods, the proposed BEMOJI with/without pre-training and with/without fine-tuning shows the best performance on the Chinese dataset.

In order to better show the performance difference between BEMOJI and other models, Fig. 5 shows the best results achieved by the above eight models on the Chinese dataset.

Figure 5.

The best results on the Chinese dataset.

From these observations, we can conclude that: (1) BEMOJI outperforms other transformer-based models, especially BERT (base), because the backbone of BEMOJI is BERT (base), which means emoji descriptions can provide sentiment information. (2) Pre-trained language models outperform RNN-based models.

Table 5

Experimental results on the English dataset (%). PT and FT stand for pre-training and fine-tuning, respectively. T and F stand for true and false, respectively. If FT is true (T), we only fine-tune the classifier. ^* means that we reproduce the model. The underlining data represents the best results achieved by the baseline models. Bold data indicates the best results

Model	PT	FT	Acc	Macro-P	Macro-R	Macro-F1
DeepMoji^*	T	T	59.21	29.60	50.00	37.19
WATT-BiE-LSTM^*	–	T	89.34	89.04	88.93	88.98
MATT-BiE-LSTM^*	–	T	73.90	73.36	71.92	72.32
BERT (base)	F	F	87.36	86.96	86.85	86.90
	F	T	86.64	86.07	86.52	86.26
	T	F	88.09	87.71	87.60	87.65
	T	T	88.45	88.12	87.90	88.01
BERT (large)	F	F	85.56	85.68	84.23	84.77
	F	T	87.00	86.49	86.69	86.58
	T	F	84.84	84.22	84.86	84.46
	T	T	84.84	84.24	84.99	84.50
RoBERTa (base)	F	F	88.45	87.93	88.32	88.11
	F	T	87.73	87.17	87.71	87.39
	T	F	87.36	86.79	87.54	87.07
	T	T	88.09	87.53	88.15	87.79
RoBERTa (large)	F	F	86.64	86.21	86.11	86.15
	F	T	89.89	90.01	88.99	89.41
	T	F	89.17	88.72	88.93	88.82
	T	T	88.45	88.04	88.04	88.04
XLNet (base)	F	F	87.73	87.37	87.16	87.26
	F	T	87.73	87.37	87.16	87.26
XLNet (large)	F	F	84.48	84.10	83.59	83.82
	F	T	82.31	81.76	81.48	81.61
DistilBERT (base)	F	F	87.00	86.49	86.69	86.58
	F	T	85.92	85.33	85.77	85.52
	T	F	87.36	87.26	86.44	86.79
	T	T	88.45	87.98	88.18	88.07
BEMOJI (ours)	F	F	92.58	92.00	92.84	92.34
	F	T	92.97	92.36	93.49	92.77
	T	F	93.36	92.85	93.49	93.13
	T	T	93.36	92.79	93.65	93.15

5.2.2 Experimental results on the English dataset

On the English dataset, we also choose DeepMoji [16], WATT-BiE-LSTM [29], MATT-BiE-LSTM [29], BERT (base) [36], BERT (large) [36], RoBERTa (base) [37], RoBERTa (large) [37], XLNet (base) [53], XLNet (large) [53] and DistilBERT (base) [54] as our baseline methods.

Table 5 shows the experimental results on the English dataset. From these results, we can observe that: (1) Compared to all models, DeepMoji with pre-training and fine-tuning shows relatively low performance across all metrics. (2) WATT-BiE-LSTM with fine-tuning achieves higher scores than DeepMoji with pre-training and fine-tuning. (3) MATT-BiE-LSTM with fine-tuning demonstrates lower performance than WATT-BiE-LSTM with fine-tuning. (4) For BERT (base) and BERT (large) without pre-training and with fine-tuning, the former’s performance is lower than the latter. However, for other cases that have or do not have pre-training and fine-tuning, BERT (base) outperforms BERT (large). (5) For RoBERTa (base) and RoBERTa (large) without pre-training and fine-tuning, the former outperforms the latter. However, for other cases that have or do not have pre-training and fine-tuning, RoBERTa (base) obtains lower performance than RoBERT (large). (6) XLNet (base) [53] and XLNet (large) without pre-training and fine-tuning (or without pre-training and with fine-tuning), the base version outperforms the large version. (7) DistilBERT (base) with pre-training and fine-tuning achieves the best performance compared with the model under other cases. (8) BEMOJI with pre-training and fine-tuning achieves the best results on the accuracy, Macro-R and Macro-F1 scores (93.36%, 93.65% and 93.15%, respectively), while BEMOJI with pre-training and without fine-tuning achieves the best Macro-P score (92.85%). (9) The best results achieved by BEMOJI are 3.47%, 2.84%, 4.66% and 3.74% higher than the model with the second highest results (i.e., RoBERTa (large) without pre-training and with fine-tuning) on the accuracy, Macro-P, Macro-R and Macro-F1 scores, respectively. (10) The lowest scores achieved by BEMOJI (92.58%, 92.00%, 92.84% and 92.34%) are 2.69%, 1.99%, 3.85% and 2.93% higher than the model with the second highest results on the accuracy, Macro-P, Macro-R and Macro-F1 scores, respectively. (11) BEMOJI with pre-training and fine-tuning achieves 4.91%, 4.67%, 5.75% and 5.14% higher than BERT with pre-training and fine-tuning on the accuracy, Macro-P, Macro-R and Macro-F1 scores, respectively. (12) WATT-BiE-LSTM achieves the highest accuracy (89.34%), Macro-P (89.04%), Macro-R (88.93%) and Macro-F1 (88.98%) among all emoji-based models. Meanwhile, WATT-BiE-LSTM achieves comparable results with transformer-based models, which are only 0.55%, 0.97%, 0.06% and 0.43% lower than RoBERTa (large) without pre-training and with fine-tuning on the accuracy, Macro-P, Macro-R and Macro-F1 scores, respectively. However, BEMOJI with pre-training and fine-tuning is still 4.02%, 3.75%, 4.72% and 4.17% higher than WATT-BiE-LSTM on the accuracy, Macro-P, Macro-R and Macro-F1 scores, respectively. (13) Models that utilize both pre-training and fine-tuning generally perform better than these without pre-training and fine-tuning, indicating the importance of the two strategies in sentiment analysis models on the English dataset. (14) Compared to all baselines, the proposed BEMOJI with/without pre-training and with/without fine-tuning shows the best performance on the English dataset.

In order to better show the performance difference between BEMOJI and other models, Fig. 5 shows the best results achieved by the above nine models on the English dataset.

Figure 6.

The best results on the English dataset.

From these observations, we can conclude that: (1) The pre-training and fine-tuning strategies generally improve the performance of sentiment analysis models on the Chinese and English datasets, implying that the two strategies can make good use of text and emoji information in texts with emojis. (2) BEMOJI also outperforms other models on the English dataset, which means emoji descriptions can provide sentiment information. (3) Compared to all baseline methods, BEMOJI obtains the best performance on Chinese and English datasets.

5.3 Ablation study

In this subsection, we explore the effects of the emoji descriptions and the fusion layer on BEMOJI. w/o emojis stands for fine-tuning BEMOJI without EBERT, which means our model only contains CBERT, half fusion layer as shown by Eq. (23) and classifier. w/o fusion means our model replaces the fusion layer of BEMOJI with concatenate operation as shown by Eq. (24).

$\displaystyle\bm{h}^{f}=\sigma(\bm{W}^{t}\bm{h}^{t}+\bm{b}^{f}).$ (23) $\displaystyle\bm{h}^{f}={\rm concat}(\bm{h}^{t},\bm{h}^{e}).$ (24)

Table 6 shows the results of the ablation study. From these results, we can observe that: (1) The emojis in the pre-training and fine-tuning stages can make the model better perform sentiment classification when the BEMOJI model does the emoji sentiment analysis task. (2) Without emoji descriptions, the accuracy score achieved by BEMOJI drops by 3.3% (6.64%), and the Macro-F1 score drops by 3.55% (6.92%) on the Chinese dataset (on the English dataset, respectively). (3) Without the fusion layer, the accuracy score achieved by BEMOJI drops by 0.61% (1.17%), and the Macro-F1 drops by 0.68% (1.18%) on the Chinese dataset (on the English dataset, respectively). (4) BEMOJI without emoji descriptions is 0.23% higher than BERT on the accuracy score and 0.21% higher than BERT on the Macro-F1 score on the Chinese dataset, while BEMOJI without emoji descriptions is 1.73% lower than BERT on the accuracy score and 1.78% lower than BERT on the Macro-F1 score on the English dataset. And (5) BEMOJI without the fusion layer is 2.92% higher than BERT on the accuracy score and 3.08% higher than BERT on the Macro-F1 score on the Chinese dataset, while BEMOJI without the fusion layer is 3.74% higher than BERT on the accuracy score, and 3.96% higher than BERT on the Macro-F1 score on the English dataset.

From these observations, we can conclude that: (1) Emoji descriptions and the fusion layer can help BEMOJI to achieve better performance. (2) Since the backbone of BEMOJI is BERT (base), the performance of BEMOJI without emoji descriptions is similar to BERT (base), which means emoji descriptions provide a lot of sentiment information.

Table 6

Ablation study on the Chinese and English datasets (%). Experimental results are output from the pre-trained and fine-tuned models

	Chinese		English
Model	Acc	Macro-F1	Acc	Macro-F1
BEMOJI	91.41	90.95	93.36	93.15
w/o emojis	88.11	87.40	86.72	86.23
w/o fusion	90.80	90.27	92.19	91.97
BERT (base)	87.88	87.19	88.45	88.01

5.4 Robustness study

We adopt four sentiment analysis datasets without emojis (two datasets are Chinese datasets, and the other are English datasets) to illustrate the robustness of BEMOJI. We adopt the SST-2 [57], SemEval-2013 Task 2 [58], NLPCC-2014 Task 1,7 and SMP2019-ECISA8 datasets. For SemEval-2013 Task 2, we only consider positive, negative and neutral sentiments; for NLPCC-2014 Task 1, we only classify whether the text carries sentiment. The details of these four datasets are shown in Table 7.

Table 7
The details of robustness study datasets

		Train	Dev	Test
SST-2	Positive	37569	444	912
	Negative	29780	428	909
SemEval-2013	Positive	2974	483	1281
	Negative	1159	280	472
	Neutral	1446	618	1385
SMP2019-ECISA	Positive	3799	1229	919
	Negative	3934	1353	979
	Neutral	6956	2550	1902
NLPCC2014	Positive	23784	5947	11870
	Negative	12552	3138	3823

We compare BEMOJI and BERT (base) on these four datasets. For BEMOJI, we only utilize the pre-trained ${\rm CBERT}$ and a randomly initialized classifier. For BERT, we adopt the pre-trained parameters in Hugging Face as the start point and randomly initialize a classifier. BEMOJI and BERT are pre-trained on these four datasets and will fine-tune all the parameters.

The robustness study results are shown in Table 8. From these results, we can observe that: (1) BERT (base) achieves a higher accuracy score on SST-2 (93.5%) and a higher Macro-F1 score on SemEval-2013 and SMP2019-ECISA (73.27% and 77.89%), while BEMOJI achieves a higher accuracy score on SemEval-2013 and NLPCC2014 (73.93% and 80.30%) and a higher Macro-F1 on NLPCC2014 (76.12%). (2) BEMOJI is 2.3% lower than BERT on the accuracy score on the SST-2; BEMOJI is 0.19% higher than BERT on the accuracy score, and 0.63% lower than BERT on the Macro-F1 score on the SemEval-2013; BEMOJI is 0.34% lower than BERT on the Macro-F1 score on the SMP2019-ECISA, and BEMOJI is 1.26% higher than BERT on the accuracy score, and 0.81% higher than BERT on the Macro-F1 score on NLPCC2014.

From these observations, BEMOJI has strong robustness and can achieve comparable results to BERT on non-emoji sentiment analysis tasks.

Table 8

Robustness study on non-emoji sentiment classification tasks (%)

Dataset	Metric	BEMOJI	BERT
SST-2	ACC	91.2	93.5
SemEval-2013	ACC	73.93	73.74
	Macro-F1	72.65	73.27
SMP2019-ECISA	Macro-F1	77.55	77.89
NLPCC2014	ACC	80.30	79.04
	Macro-F1	76.12	75.31

5.5 Case study

In this subsection, we select some Chinese and English quotes as inputs to BEMOJI and BERT in order to analyze the resulting emojis predicted by these two models. This experiment aims to evaluate the models’ ability to accurately represent the sentiments conveyed in the input texts through the resulting emojis.

Figure 7.

The experimental results of case studies. The value inside the parentheses refers to the cosine similarity between the sentence embedding of an input text and an emoji description. And the average value refers to the average similarity between the sentence embedding of the top 5 emoji descriptions and the input sentence.

In this study, we utilize two well pre-trained models, including BEMOJI and BERT, to learn the correlation between input text $\{w_{1},w_{2},\ldots,w_{n}\}$ and emoji description $e$ , and predict the top 5 emojis associated with a selection of quotes. Besides, to ensure an impartial and objective assessment of the experimental outcomes, we incorporate Word2Vec to represent the input sentences and the predicted emoji descriptions. Then we calculate the cosine similarities between the sentence embedding and emoji description embeddings. Specifically, we utilize Word2Vec embeddings for each word in the input sentences and emoji descriptions, and adopt average pooling to obtain the sentence embeddings. Figure 7 shows the experimental results of case studies.

Based on these results, we can observe that: (1) BEMOJI can output more appropriate emojis than BERT. And (2) the cosine similarity between the sentence embedding of an emoji description generated by BEMOJI and an input sentence is higher than BERT. From these observations, we can conclude that the correlations between texts and emoji descriptions during the pre-training stage can enhance the ability of the sentiment analysis model to learn the association between texts and emojis.

6. Discussion

Experimental results show that the proposed BEMOJI improves the performance of sentiment analysis by using prompt learning on both Chinese and English datasets. Tables 4 and 5 show that BEMOJI outperforms emoji- and transformer-based models on the Chinese and English datasets. Compared with transformer-based models, BEMOJI can achieve better results than other transformer-based models on the sentiment analysis of sentences with emojis. The main reason is that BEMOJI considers emoji information at the pre-training and fine-tuning stages. Meanwhile, compared with other emoji-based models, BEMOJI achieves higher results than other emoji-based models. The reason is that BEMOJI adopts a powerful pre-trained language model as its backbone. The pre-trained language models can better represent sentences than RNN-based models. In addition, DeepMoji [16], WATT-BiE-LSTM [29] and MATT-BiE-LSTM [29] only treat emojis as unique tokens, and adopt different embeddings to represent them, which ignores the sentiment information inside emojis. Furthermore, BEMOJI adopts prompt learning to force the model to learn the correlations between input sentences and corresponding emojis, which makes BEMOJI can better learn the sentiment information inside emojis.

Table 6 shows the ablation study results. From these results, we can conclude that emoji information helps BEMOJI to achieve better results. Without emojis, BEMOJI obtains comparable results with BERT (base). That is because the backbone of BEMOJI is BERT (base). Meanwhile, the proposed fusion layer is essential for BEMOJI because it fuses two different representations into the same vector space.

Table 8 shows the robustness study results. From these results, we can conclude that BEMOJI is a robust model. When we remove emoji information, BEMOJI can still finish the sentiment analysis tasks.

Figure 7 shows the case study results. From these results, we can find that pre-training BERT model with texts and emoji descriptions can well understand the associations between them.

Based on the discussion above, the proposed BEMOJI model achieves the best performance compared with the existing models.

7. Conclusion

This paper proposed a BEMOJI model to incorporate emoji sentiment information into language representation models. Specifically, we used prompt learning to pre-train BEMOJI to learn the relationship between emoji description and text. When we fine-tuned BEMOJI on downstream tasks, we fused emoji description and text representation so that BEMOJI could better understand text sentiment. The experimental results illustrated that BEMOJI analyzed the sentiment of texts with emojis better than other baseline methods. The ablation study illustrated that the emoji descriptions and the fusion layer could make BEMOJI perform well. BEMOJI also has strong robustness in analyzing the sentiment of texts without emojis. Case study shows that BEMOJI can produce more appropriate emoji than BERT.

However, some issues still need to be solved for future research: (1) The pre-trained language representation models are fragile [59], and a slight change will affect the emojis output by the model. A problem is how to make the model more robust and output more reasonable emojis. (2) Texts and emojis expressed by many people are sarcastic. How to identify these sarcastic texts and emojis is another study problem. (3) Abundant real-world corpora are collected to enrich pre-training data.

Date availability

The data used to support the findings of this study are available from the corresponding author upon request. All source codes are available at https://github.com/Balding-Lee/BEMOJI.

Authors’ contributions

The authors claim that the research was realized in collaboration with the same responsibility. All authors read and approved the last version of the manuscript.

Footnotes

Since the number of emojis is small, we fix all the parameters of ${\rm EBERT}$ except the Pooler layer.

https://huggingface.co/bert-base-uncased.

https://huggingface.co/bert-base-chinese.

https://www.emojiall.com/zh-hans/platform-weibo.

https://www.emojiall.com/zh-hans.

https://emojipedia.org/.

http://tcci.ccf.org.cn/conference/2014/pages/page04_eva.html.

http://sa-nsfc.com/evaluation/ecisa/dataset.

Acknowledgments

The authors would like to appreciate the editors’ and anonymous reviewers’ helpful comments and constructive suggestions, which have improved the quality of this manuscript. This work is partially supported by the Sichuan Science and Technology Program (Nos. 2022YFG0378, 2023YFS0424, 2023YFH0058 and 2023YFQ0044), Yibin Science and Technology Program (No. 2023SF004), Engineering Research Center for ICH Digitalization and Multi-source Information Fusion (Fujian Polytechnic Normal University), Fujian Province University (No. G3-KF2022), and Innovation Fund of Postgraduate, Xihua University (Grant No. YCJJ2021025, YCJJ2021031 and YCJJ2021124).

Conflict of interest

All authors declare no conflicts of interest.

References

Liao

Wang

Chen

Wang

and Zhang

, Dynamic commonsense knowledge fused method for Chinese implicit sentiment analysis, Information Processing and Management 59(3) (2022), 102934.

Liu

, Sentiment Analysis and Opinion Mining, in: Synthesis Lectures on Human Language Technologies 5.1 (2012), 2011, pp. 1–167.

Qian

Mathur

Zakaria

N.H.

Arora

Gupta

and Ali

, Understanding public opinions on social media for financial sentiment analysis using AI-based techniques, Information Processing and Management 59(6) (2022), 103098.

Jain

D.K.

Boyapati

Venkatesh

and Prakash

, An intelligent cognitive-inspired computing with big data analytics framework for sentiment analysis and classification, Information Processing and Management 59(1) (2022), 102758.

Wei

Liao

Yang

Wang

and Zhao

, BiLSTM with multi-polarity orthogonal attention for implicit sentiment analysis, Neurocomputing 383 (2020), 165–173.

Halim

Atif

Rashid

and Edwin

C.A.

, Profiling players using real-world datasets: Clustering the data and correlating the results with the big-five personality traits, IEEE Transactions on Affective Computing 10(4) (2019), 568–584.

Halim

and Zouq

, On identification of big-five personality traits through choice of images in a real-world setting, Multimedia Tools and Applications 80(24) (2021), 33377–33408.

Tahir

Tubaishat

Al-Obeidat

Shah

Halim

and Waqas

, A novel binary chaotic genetic algorithm for feature selection and its utility in affective computing and healthcare, Neural Computing and Applications 34(14) (2022), 11453–11474.

Ghosh

Ekbal

and Bhattacharyya

, A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes, Cognitive Computation 14(1) (2022), 110–129.

10.

Halim

and Rehan

, On identification of driving-induced stress using electroencephalogram signals: A framework based on wearable safety-critical scheme and machine learning, Information Fusion 53 (2020), 66–79.

11.

Deng

and Wiebe

, Sentiment Propagation via Implicature Constraints, in: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, The Association for Computer Linguistics, 2014, pp. 377–385.

12.

Liao

Wang

and Li

, Identification of fact-implied implicit sentiment based on multi-level semantic fused representation, Knowledge Based Systems 165 (2019), 197–207.

13.

and Chen

, ISWR: An Implicit Sentiment Words Recognition Model Based on Sentiment Propagation, in: Natural Language Processing and Chinese Computing – 10th CCF International Conference, NLPCC 2021, Vol. 13029, Springer, 2021, pp. 248–259.

14.

Poria

Hazarika

Majumder

and Mihalcea

, Beneath the Tip of the Iceberg: Current Challenges and New Directions in Sentiment Analysis Research, 2020.

15.

Zhuang

Liu

Hung

and Chai

, Implicit sentiment analysis based on multi-feature neural network model, Soft Computing 26(2) (2022), 635–644.

16.

Felbo

Mislove

Søgaard

Rahwan

and Lehmann

, Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017 Palmer

Hwa

and Riedel

, eds, 2017, pp. 1615–1625.

17.

Chen

Cao

Mei

and Liu

, SEntiMoji: an emoji-powered learning approach for sentiment analysis in software engineering, in: Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, ACM, 2019, pp. 841–852.

18.

Gavilanes

M.F.

Juncal-Martínez

García-Méndez

Costa-Montenegro

and González-Castaño

F.J.

, Creating emoji lexica from unsupervised sentiment analysis of their descriptions, Expert Systems with Applications 103 (2018), 74–91.

19.

Gavilanes

M.F.

Costa-Montenegro

García-Méndez

González-Castaño

F.J.

and Juncal-Martínez

, Evaluation of online emoji description resources for sentiment analysis purposes, Expert Systems with Applications 184 (2021), 115279.

20.

Wang

Zhou

Sun

Gui

Zhang

and Huang

, Causal Intervention Improves Implicit Sentiment Analysis, in: Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, International Committee on Computational Linguistics, 2022, pp. 6966–6977.

21.

Yin

and Shang

, ContextBert: Enhanced Implicit Sentiment Analysis Using Implicit-sentiment-query Attention, in: International Joint Conference on Neural Networks, IJCNN 2022, IEEE, 2022, pp. 1–8.

22.

Zhou

Wang

Zhang

and He

, Implicit Sentiment Analysis with Event-centered Text Representation, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Association for Computational Linguistics, 2021, pp. 6884–6893.

23.

Bordes

Usunier

García-Durán

Weston

and Yakhnenko

, Translating embeddings for modeling multi-relational data, in: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, NIPS 2013, 2013, pp. 2787–2795.

24.

Velickovic

Cucurull

Casanova

Romero

Liò

and Bengio

, Graph Attention Networks, in: 6th International Conference on Learning Representations, ICLR 2018, OpenReview.net, 2018, pp. 1–12.

25.

Wang

Feng

Yang

and Zhang

, KC-ISA: An Implicit Sentiment Analysis Model Combining Knowledge Enhancement and Context Features, in: Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, International Committee on Computational Linguistics, 2022, pp. 6906–6915.

26.

Zhong

Lin

and He

, Dynamic multi-scale topological representation for enhancing network intrusion detection, Computers & Security 135 (2023), 103516.

27.

Liu

Zheng

and Lin

, Cross-domain sentiment aware word embeddings for review sentiment analysis, International Journal of Machine Learning and Cybernetics 12(2) (2021), 343–354.

28.

Rzepka

Ptaszynski

and Araki

, Emoji-Aware Attention-based Bi-directional GRU Network Model for Chinese Sentiment Analysis, in: Joint Proceedings of the Workshops on Linguistic and Cognitive Approaches to Dialog Agents (LaCATODA 2019) and on Bridging the Gap Between Human and Automated Reasoning (BtG 2019) co-located with 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), CEUR Workshop Proceedings, Vol. 2452, CEUR-WS.org, 2019, pp. 11–18.

29.

Chen

Yuan

You

and Luo

, Twitter Sentiment Analysis via Bi-sense Emoji Embedding and Attention-based LSTM, in: 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018, ACM, 2018, pp. 117–125.

30.

Chen

Lin

Polat

Alhudhaif

and Alenezi

, Consistency- and dependence-guided knowledge distillation for object detection in remote sensing images, Expert Systems with Applications 229(Part A) (2023), 120519.

31.

Chen

Lin

Liu

Yang

Zhang

and Xu

, NT-DPTC: A non-negative temporal dimension preserved tensor completion model for missing traffic data imputation, Information Sciences 653 (2024), 119797.

32.

Al-Halah

Aitken

A.P.

Shi

and Caballero

, Smile, Be Happy: Emoji Embedding for Visual Sentiment Analysis, in: 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, IEEE, 2019, pp. 4491–4500.

33.

You

Luo

Jin

and Yang

, Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI 2015, AAAI Press, 2015, pp. 381–388.

34.

Laurenceau

Louis

J.D.

and Gilbert

J.E.

, Examining Bias in Sentiment Analysis Algorithms Interacting with Emojis with Skin Tone Modifiers, in: HCI International 2022 Posters – 24th International Conference on Human-Computer Interaction, HCII 2022, Communications in Computer and Information Science, Vol. 1582, Springer, 2022, pp. 566–573.

35.

Yuan

Zhang

and Lv

, Pay attention to emoji: Feature Fusion Network with EmoGraph2vec Model for Sentiment Analysis, in: 26th International Conference on Pattern Recognition, ICPR 2022, IEEE, 2022, pp. 1529–1535.

36.

Devlin

Chang

Lee

and Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Association for Computational Linguistics, 2019, pp. 4171–4186.

37.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

and Stoyanov

, RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019.

38.

Liu

Yuan

Jiang

Hayashi

and Neubig

, Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing, 2021.

39.

Petroni

Rocktäschel

Riedel

Lewis

P.S.H.

Bakhtin

and Miller

A.H.

, Language Models as Knowledge Bases, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Association for Computational Linguistics, 2019, pp. 2463–2473.

40.

Qian

Liu

Ding

Qiu

Yang

and Tang

, All NLP Tasks Are Generation Tasks: A General Pretraining Framework, 2021.

41.

Wang

Pruksachatkun

Nangia

Singh

Michael

Hill

Levy

and Bowman

S.R.

, SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, in: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 2019, pp. 3261–3275.

42.

Schick

and Schütze

, Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Association for Computational Linguistics, 2021, pp. 255–269.

43.

Jiang

Huang

Zhang

Wang

Zhuang

Wei

Huang

Zhang

and Zhang

, PromptBERT: Improving BERT Sentence Embeddings with Prompts, 2022.

44.

Sun

Zheng

Hao

and Qiu

, NSP-BERT: A Prompt-based Few-Shot Learner through an Original Pre-training Task – Next Sentence Prediction, in: Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, International Committee on Computational Linguistics, 2022, pp. 3233–3250.

45.

Xiang

Wang

Dai

and Wang

, ConnPrompt: Connective-cloze Prompt Learning for Implicit Discourse Relation Recognition, in: Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, International Committee on Computational Linguistics, 2022, pp. 902–911.

46.

Grill

Strub

Altché

Tallec

Richemond

P.H.

Buchatskaya

Doersch

Pires

B.Á.

Guo

Azar

M.G.

Piot

Kavukcuoglu

Munos

and Valko

, Bootstrap Your Own Latent – A New Approach to Self-Supervised Learning, in: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 2020, pp. 21271–21284.

47.

Groenendijk

Karaoglu

Gevers

and Mensink

, Multi-Loss Weighting with Coefficient of Variations, in: IEEE Winter Conference on Applications of Computer Vision, WACV 2021, IEEE, 2021, pp. 1468–1477.

48.

Zhang

Han

Liu

Jiang

Sun

and Liu

, ERNIE: Enhanced Language Representation with Informative Entities, in: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Association for Computational Linguistics, 2019, pp. 1441–1451.

49.

Wang

Gao

Zhu

Zhang

Liu

and Tang

, KEPLER: A unified model for knowledge embedding and pre-trained language representation, Transactions of the Association for Computational Linguistics 9 (2021), 176–194.

50.

Hendrycks

and Gimpel

, Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units, 2016.

51.

Shao

Geng

Liu

Dai

Yang

Zhe

Bao

and Qiu

, CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation, 2021.

52.

Loshchilov

and Hutter

, Decoupled Weight Decay Regularization, in: 7th International Conference on Learning Representations, ICLR 2019, OpenReview.net, 2019, pp. 1–8.

53.

Yang

Dai

Yang

Carbonell

J.G.

Salakhutdinov

and Le

Q.V.

, XLNet: Generalized Autoregressive Pretraining for Language Understanding, in: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 2019, pp. 5754–5764.

54.

Sanh

Debut

Chaumond

and Wolf

, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, 2019.

55.

Cui

Che

Liu

Qin

Wang

and Hu

, Revisiting Pre-Trained Models for Chinese Natural Language Processing, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Findings of ACL, Association for Computational Linguistics, 2020, pp. 657–668.

56.

Abdaoui

Pradel

and Sigel

, Load What You Need: Smaller Versions of Mutlilingual BERT, in: SustaiNLP/EMNLP, 2020, pp. 119–123.

57.

Socher

Perelygin

Chuang

Manning

C.D.

A.Y.

and Potts

, Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, Association for Computational Linguistics, 2013, pp. 1631–1642.

58.

Nakov

Rosenthal

Kozareva

Stoyanov

Ritter

and Wilson

, SemEval-2013 Task 2: Sentiment Analysis in Twitter, in: Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2013, Association for Computer Linguistics, 2013, pp. 312–320.

59.

Lin

B.Y.

Lee

Khanna

and Ren

, Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Association for Computational Linguistics, 2020, pp. 6862–6868.

60.

Zhang

Ren

and Sun

, Deep Residual Learning for Image Recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, 2016, pp. 770–778.

61.

L.J.

Kiros

J.R.

and Hinton

G.E.

, Layer Normalization, CoRR, 2016.

62.

Glorot

Bordes

and Bengio

, Deep Sparse Rectifier Neural Networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, JMLR Proceedings, Vol. 15, JMLR.org, 2011, pp. 315–323.

63.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

and Polosukhin

, Attention is All you Need, in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, NIPS 2017, 2017, pp. 5998–6008.

64.

Lin

Luo

and Xu

, HRST-LR: A hessian regularization spatio-temporal low rank algorithm for traffic data imputation, IEEE Transactions on Intelligent Transportation Systems 24(10) (2023), 11001–11017.

Incorporating emoji sentiment information into a pre-trained language model for Chinese and English sentiment analysis

Abstract

Keywords

1. Introduction

1.1 Background and Motivation

1.3 Solutions and contributions

2. Related work

2.1 Implicit sentiment analysis

2.2 Emoji sentiment analysis

2.3 Prompt learning

3. Research objectives

3.1 Pre-training objectives

4.1 Pre-training BEMOJI

Table 1 The details of an example: “I am going shopping [grin].”

5.1 Datasets and implementation

5.1.1 Pre-training datasets

5.1.2 Fine-tuning datasets

Table 2 The details of fine-tuning datasets

5.1.4 Knowledge fusion

5.1.5 Implementation

Table 3 The settings of hyperparameters. n stands for the input sequence length, while m represents the emoji description length

DeepMoji [16]

WATT-BiE-LSTM [29]

MATT-BiE-LSTM [29]

BERT [36]

RoBERTa [37]

XLNet [53]

DistilBERT [54]

Table 7 The details of robustness study datasets

7. Conclusion

Date availability

Authors’ contributions

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
The details of an example: “I am going shopping [grin].”

Table 2
The details of fine-tuning datasets

Table 3
The settings of hyperparameters. $n$ stands for the input sequence length, while $m$ represents the emoji description length

Table 7
The details of robustness study datasets