Topic-BERT: Detecting harmful information from social media

Abstract

Harmful information identification is a critical research topic in natural language processing. Existing approaches have been focused either on rule-based methods or harmful text identification of normal documents. In this paper, we propose a BERT-based model to identify harmful information from social media, called Topic-BERT. Firstly, Topic-BERT utilizes BERT to take additional information as input to alleviate the sparseness of short texts. The GPU-DMM topic model is used to capture hidden topics of short texts for attention weight calculation. Secondly, the proposed model divides harmful short text identification into two stages, and different granularity labels are identified by two similar sub-models. Finally, we conduct extensive experiments on a real-world social media dataset to evaluate our model. Experimental results demonstrate that our model can significantly improve the classification performance compared with baseline methods.

Keywords

Harmful short text text classification BERT topic model GPU-DMM

1. Introduction

With the rapid development of the mobile Internet, many mobile applications encourage users to comment freely, making these applications more attractive to users. Netizens can interact through social media such as forums, Twitter or Facebook, and they can also post comments on news sites. On the Internet, any user can hide personal information, such as name, occupation and home address. Therefore, some netizens employ the anonymity and high efficiency of the Internet to spread a lot of harmful information, such as pornography, advertisements and violence [1]. These harmful information spreads quickly to every corner of the Internet, which not only affected the community environment, but also increased the difficulty for netizens to obtain useful information. It can be seen that how to detect harmful information is of great significance to improve user experience and information retrieval efficiency.

The rule-based harmful text identification method can identify whether the text contains sensitive information. However, this kind of method is difficult to understand the semantics of the text, which in turn cannot distinguish harmful words after deformation. Harmful text identification technology based on content understanding can identify standard length documents such as news, blogs and so forth. Nevertheless, for short texts on social media such as news headlines and tweets, their length is short and their language is not standardized. These features make this kind of method less effective for short texts.

Recently, the Bidirectional Encoder Representations from Transformers (BERT) model has achieved great success in the field of natural language processing [2, 3]. BERT is a bidirectional variant of the Transformer model [4] that is trained to classify whether two sentences are consecutive and predict a masked word from its context. The trained BERT model can be fine-tuned for downstream natural language processing tasks such as recommend system [5], lexical substitution [6] and sentiment analysis [7]. The performance of BERT in the eleven tasks has been significantly improved over previous state-of-the-art models [2]. This extraordinary result shows that BERT can learn the structural and semantic information about language.

Inspired by this, we propose a novel model for harmful short text identification to address the above challenges, named Topic-based Bidirectional Encoder Representations from Transformers (Topic-BERT). The main idea of Topic-BERT comes from the answers to the following two questions: (1) How to learn more robust short text representations to extract text features suitable for harmful short text identification? (2) How to extend short texts with additional information to alleviate the data sparsity problem in short text classification [8]?

Specifically, after encoding the short text as part of the feature, our model transforms BERT by making full use of all hidden states as our main encoder. In addition to the short text content itself, there are several types of additional information, such as number of likes, retweets and so forth, which can be modeled as features. Topic-BERT exploits parts of additional information as input, and calculates the attention weight using topic information. Different from other researches, we regard harmful short text identification as a fine-grained multiple-classification task. In aspect-based fine-grained sentiment analysis, researchers generally classify aspects first, and then perform sentiment analysis according to the classified aspects. Inspired by this, we adopts a two-stage model to distinguish labels of different granularities for multiple-classification. The major contributions of this paper are summarized as follows:

1.
We propose a novel model to identify harmful information from short texts. The model employs BERT to take part of additional information as input to solve the sparseness of short texts, and uses GPU-DMM (Generalized Pólya Urn – Dirichlet Multinomial Mixture) [9] to learn the topic information of short texts for attention weight calculation. To the best of our knowledge, this is the first work to integrate topic knowledge based on GPU-DMM into BERT.
2.
Our model divides the recognition of harmful short texts into two stages, distinguishing different granular labels at different stages. Furthermore, unlike the BERT model that uses only the first hidden state as a sequence representation, Topic-BERT uses all hidden states and applies a topic feature attention mechanism to calculate weights.
3.
The performance of Topic-BERT is evaluated on a real-world short text corpus against four baseline models. Experimental results on binary-classification and multiple-classification tasks show that the proposed model is better than other baselines in precision, recall, and F1 measure.

The rest of the paper is organized as follows. In Section 2, we gives a brief overview on related work. The details of the proposed model are presented in Section 3. We describe datasets and experimental results in Section 4 and finally, Section 5 concludes.
2. Related work

Harmful short text identification can be regarded as a binary-classification task, that is, short texts are classified as harmful short texts and normal short texts. Unlike lengthy documents such as news and academic papers, short text are more ambiguous because they do not have enough contextual information, which brings great challenges for classification [10, 11, 12].

One type of research extends text features by increasing the topic information of short texts to enhance classification accuracy. Chen et al. proposed a method for short text classification using K-nearest neighbor clustering algorithm and the Latent Dirichlet Allocation (LDA) [13] topic model [14]. This approach leverages topic features to tackle the data sparseness problem and increases the semantic information of short text. Boom et al. proposed a hybrid method that combines word vectors, tf-idf similarity and dense distributions to reduce the impact of sparse terms on short text classification [15]. Rao et al. proposed a method for social emotion classification of short texts based on a Topic-level Maximum Entropy (TME) model. TME utilizes reader ratings, emotion labels and topic modeling to generate topic-level short text features. However, the texts on social media usually cover a wide range of topics, with non-standard terminology and little co-occurrence information. Therefore, it is difficult for traditional topic models to extract high-quality topics from short texts, resulting in low accuracy of harmful short text identification. By contrast, our method exploits GPU-DMM to extract high-quality topic features of short texts.

Recently, with the success of deep learning algorithms in natural language processing, many studies have focused on using word embeddings to expand the features of short texts, thereby solving the problems of sparse data and insufficient information in short texts [16]. Based on convolutional neural network [17] and word embedding clustering, Wang et al. proposed a unified framework to expand the features of short texts [18]. This method first generates semantic cliques by a clustering algorithm, and then calculates the semantic unit in the short text through the combination of word embeddings and context. Lee et al. proposed a new short text classification model by combining convolutional neural network and recurrent neural network [19]. Zhang et al. proposed a new Cluster-Gated Convolutional Neural Network model (CGCNN) that simultaneously classifies words and short texts in an end-to-end manner [20]. CGCNN first learns word representations using a bi-directional long short-term memory. Then, it uses a soft clustering method to extract the semantic relationship between short texts and cluster centers.

BERT first obtains a pre-trained model by training on a large-scale unlabeled corpus, and then fine-tunes the pre-trained model and applies it to other tasks such as text classification, reading comprehension and relation extraction [2]. Zhang et al. introduced BERT to encode the input sequence into context representations and applied it to text generation tasks [21]. Through exhaustive experiments, Sun et al. investigated different fine-tuning methods of the BERT model in text classification tasks, and provided a general solution for the fine-tuning process of BERT [22]. Their model obtained new state-of-the-art results on eight text classification datasets. Zhang et al. proposed a BERT-based model for a factuality analysis and classification task [23]. The model feeds the representations of an event and its sentence into an output layer for text classification. However, the above strategies neglect the fact that related additional information may help to classify short texts.

3. Methodology

The architecture of Topic-BERT, a two-stage model based on BERT, is shown in Fig. 1. In this section, we first introduce the representation of short texts, and then illustrate the role of additional information and how to use it. Finally, the two phases of our model and their connections are explained separately.

Figure 1.

The architecture of Topic-BERT.

3.1 The representation of short texts

In the Topic-BERT model, BERT is used as the encoder to represent short texts. Without convolutional neural networks and recurrent neural networks, BERT is based on a multi-layer Transformer, which has proven to be a more powerful feature extractor [24]. With the help of large-scale corpora pre-training, BERT captures rich external knowledge, and learns the semantic information of words to help us train Topic-BERT. Based on the BERT model, Topic-BERT is fine-tuned by using harmful short text datasets, which alleviates the sparsity problem of short texts.

Let $S=(\textit{Token}_{1},\textit{Token}_{2},\ldots,\textit{Token}_{i},\ldots,% \textit{Token}_{N})$ be a short text, where $\textit{Token}_{i}$ is the $i^{\text{th}}$ token in the short text, and $N$ is the length of the short text. Each token is represented as position embeddings, segment embeddings and token embeddings. Segment embeddings are used to distinguish different statements, while different tokens are represented by different token embeddings. Because the Transformer model is position independent when encoding sentences, our model utilizes position embeddings to obtain position information. After the embedding process, $E=(E_{1},E_{2},\ldots,E_{i},\ldots,E_{N})$ replaces token set $S$ as the input of our model. The Transformer model employs multi-head attention and self-attention to encode embedded tokens $E$ into hidden states $H=(H_{1},H_{2},\ldots,H_{i},\ldots,H_{N})$ . In the self-attention mechanism, each token is transformed into queries, keys and values. The output of queries can be calculated as the attention weighted sum of values. The dot product of queries and keys determines the weights on values. In practice, the queries, keys and values are packed together into matrices $Q$ , $K$ and $V$ . The weights on the values are calculated as:

$\displaystyle\textit{Attention}(Q,K,V)=\textit{softmax}\left(\frac{QK^{T}}{% \sqrt{d_{k}}}\right)V,$ (1)

where $d_{k}$ denotes the dimension of queries and keys.

Multi-head attention allows the model to capture different relationships in a sentence at different positions, which can increase the diversity of attention. All heads are connected and the final hidden state is determined by:

$\displaystyle\textit{MultiHead}(Q,K,V)$ $\displaystyle\quad=\textit{Concat}(\textit{head}_{1},\ldots,\textit{heard}_{h}% )W^{O}$ (2) $\displaystyle\textit{head}_{i}=\textit{Attention}(QW^{Q}_{i},KW^{K}_{i},VW^{V}% _{i}),$

where Concat denotes the concatenation function, $W^{Q}_{i}\linebreak\in\mathbb{R}^{d_{\textit{model}}\times d_{k}}$ , $W^{K}_{i}\in\mathbb{R}^{d_{\textit{model}}\times d_{k}}$ , $W^{V}_{i}\in\mathbb{R}^{d_{\textit{model}}\times d_{v}}$ , $W^{O}_{i}\in\mathbb{R}^{hd_{\textit{model}}\times d_{v}}$ represent parameter matrices. Following BERT, we set attention layers $h=12$ , $d_{k}=d_{v}=d_{\textit{model}}/h=64$ . Since the dimension of each multi-head layer is reduced, the total computational cost is similar to that of single-head attention in the full dimension.

In the original BERT model, [CLS] is added to the word sequence as the first token, and its hidden state is employed as the input vector representation of the classification task. In contrast, our model makes full use of each hidden state to obtain semantic vectors, and merges the original state into the final classification input vector representation. To achieve this, Topic-BERT introduces additional information to extend the entire calculation process.

3.2 Additional information

Harmful short texts are prone to misleading and confusing, so the identification of harmful short texts is more difficult than other classification tasks [25]. In the meanwhile, the authenticity of harmful short texts tend to be more susceptible to other factors. In our habitual thinking, what reliable people say is often real, and people who behave badly are more likely to spread harmful short texts. Therefore, in addition to using the short text itself, other relevant additional information can also be a strong support for classifying harmful short texts. For short texts, author information and all relevant content can be part of the additional information. As an effective method of mining document structure and topic information from short text, topic models have been widely used in the field of short text classification. The topics of short texts can also be used as additional information. In this paper, we utilize GPU-DMM to obtain topic information of short texts, which improves short text topic modeling through the semantic information provided by word embeddings.

Based on the Dirichlet Multinomial Mixture (DMM) model, GPU-DMM leverages the Generalized Pólya Urn (GPU) model to merge word correlations learned from external text corpora. Specifically, after sampling the topic of short texts, GPU-DMM increases the probability of semantically related words under the same topic. Although the number of co-occurrences in short text collections currently being modeled is very low, GPU-DMM links semantically related words together to improve topic coherence. In the GPU-DMM model, the cosine similarity between word embeddings of two words is used to measure the semantic relevance between them. Furthermore, GPU-DMM also introduces a filtering strategy to assist the process of topic inference, so that external knowledge is only used for specific topics. Since GPU-DMM utilizes word embeddings learned from external text collections, the model can flexibly receive word embeddings learned from any other text corpus, thereby applying the model to different fields.

In the Topic-BERT model, the additional information is divided into two parts. To alleviate the sparseness of short texts, additional information can be used to help short texts enrich their semantics. Therefore, the first part of additional information is added to the head of the sentence, resulting in a longer and more complete short text representation. The Transformer structure will extract the connection between text information and additional information, making the model more robust. The second part is the hidden topic of short texts, which is used to calculate attention weights. The weight determines the usage of all hidden states in the output representation. The stronger the token is associated with the topic of a short text, the higher the attention weight. As a result, the attention can be measured by the relationship between additional information representation $E I$ and hidden state $H_{i}$ . We define the attention weight $A_{i}$ of each hidden state as:

$\displaystyle A_{i}=\frac{\exp(\textit{score}(H_{i},EI))}{\sum_{j}\exp(\textit% {score}(H_{j},EI))},$ (3)

where score denotes the score function, which can be defined as:

$\displaystyle\textit{score}(H_{i},EI)=\textit{tanh}(W^{H}[H_{i}:EI]+b),$ (4)

where tanh represents the non-linear activation function, $W^{H}$ represents the intermediate parameter matrix, and $b$ is a bias term. After the attention weight $A$ is calculated, the semantic representation $R^{*}$ can be obtained by the weighted sum of all hidden states $H$ , as shown in Eq. (5):

$\displaystyle R^{*}=\sum^{N}_{j}A_{j}H_{j}.$ (5)

Finally, Topic-BERT combines the original short text representation $R$ (black part in Fig. 1) and the semantic representation $R^{*}$ (gray part in Fig. 1) to form a new classification vector $R C$ , as follow:

$\displaystyle RC=\textit{Concat}(R,R^{*}).$ (6)

3.3 Two stages of the model

In common text classification tasks, each category is independent and easy to distinguish. However, for harmful short text identification, sometimes it is difficult to define their categories. Because many harmful short texts are intentionally created to confuse and mislead readers, a harmful short text may contain both real and harmful text information. Therefore, in most cases, harmful short text identification cannot be simply regarded as a binary classification task, which also includes a variety of categories such as advertising, violence, pornography and sensitivity.

It can be seen that “true” or “false” cannot cover all cases of harmful short texts, and more tags should be used for classification, which is more suitable for actual situations. Therefore, we need to make corresponding changes to the binary classification model. Inspired by aspect-based sentiment analysis [26], this type of method divides the classification task into two stages: sentiment and aspect. There is a specific connection between sentiment and aspect, and the classification result of aspect has an impact on sentiment classification. Similarly, Topic-BERT subdivides fine-grained labels into original and auxiliary categories, and exploits a two-stage model to distinguish coarse-grained and fine-grained labels.

The two stages of Topic-BERT have a similar structure, as shown in Fig. 1. The proposed model first divides all short texts into two categories, namely coarse-grained labels. Secondly, the different situations of the dataset determine the number of fine-grained labels. The first stage model uses the above methods to learn coarse-grained labels of short text. After this, Topic-BERT adds coarse-grained labels to the second stage model as part of the input. Following a similar process, the second stage model will encode inputs and coarse-grained labels, and then learn to classify fine-grained labels. Fine-grained and coarse-grained labels together determine the final labels of short texts.

The cross entropy loss function is used to train the model. The loss function is defined as shown in Eq. (7):

$\displaystyle\mathcal{L}=-\frac{1}{n}\sum_{i}^{n}\sum_{j}^{m}\log p_{ij}y_{ij},$ (7)

where $m$ denotes the number of labels, $n$ denotes the total number of training corpora, $p$ is the predicted label and $y$ is the ground-truth label.

4. Experiments

This section first introduces the collection process of short text corpora. Secondly, we briefly analyzes the characteristics of the corpus and settings. Finally, we verifies the effectiveness of the proposed model by comparing with baseline methods on multiple evaluation indicators.

4.1 Datasets

We use the open platform of microblog1

¹
https://open.weibo.com/.

to crawl 82156 short texts. Since some short texts only have emoticons and punctuations, they do not include any meaningful text. Therefore, we remove these invalid short texts, and finally obtains 74973 valid short texts. Five graduate students in natural language processing are invited to annotate the dataset. After statistical analysis, we choose 3270 harmful short texts, including sensitive short texts, pornographic short texts, advertising short texts and spam short texts. Furthermore, in addition to collecting text information, we also collect the number of likes, retweets, comments, author’s followers and posts to train Topic-BERT as additional information. We randomly select 30% of short texts from the entire dataset as the testing set.

Figure 2.

The text length distribution of the microblog dataset.

Figure 2 shows the text length distribution of the microblog dataset. As shown in the figure, the majority of short texts are less than 100 words in length, and over 94% of them are less than 150 words. As a result, the microblog dataset is typical short text. To better train Topic-BERT, we performed the following pre-processing steps on the dataset: (1) segment the sentence and remove stop words;2

https://github.com/fxsjy/jieba.

(2) remove the extremely short text containing only one word; (3) replace all URL information with “

\#

URL

\#

” and all account information with “

\#

ACC

\#

”. In this way, it is not only convenient to extract the features of accounts and URLs, but also to reduce the feature dimension.

4.2 Settings

4.2.1 GPU-DMM

Unlike lengthy documents such as blogs and news, short texts are noisy and sparse [27, 28]. To improve the topic quality of short texts, we utilize the GPU-DMM topic model which is more suitable for short texts to extract topic features. For the word embeddings used in the model, we employ 100-dimensional Chinese word embeddings trained by Li et al. [9] using Skip-Gram [29] on 7 million Chinese documents collected from Baike website.3

³
http://baike.baidu.com/.

If there is no embeddings for a word, we consider that the word has no semantic related words.

To verify the difference between harmful short text and normal short text in document-topic distribution, we compares the GPU-DMM topic model with Biterm Topic Model (BTM) and DMM.

•

BTM improves the sparsity of the LDA model for short text topic extraction. Its main idea is to model the words that co-occur in the same context, called biterm, without considering the order of words [30].

•

DMM proposes an assumption that each document has only one topic. Since the context of a short text is very limited, this assumption is reasonable for short texts [31].

For all topic models in comparison, we set the Dirichlet hyperparameters $\alpha=50/K$ , $\beta=0.01$ , where $K$ denotes the number of topics. The number of Gibbs sampling iterations is 1000. In GPU-DMM, we set the amount of promotion $\mu=0.2$ .

Figure 3.

Topic distribution of BTM.

Figure 4.

Topic distribution of DMM.

4.2.2 Topic-BERT

Topic-BERT is based on pre-trained $\textit{BERT}_{\textit{BASE}}$ with approximately 110 million parameters, which is more suitable for the proposed model than $\textit{BERT}_{\textit{LARGE}}$ . $\textit{BERT}_{\textit{LARGE}}$ has about 340 million parameters and requires more computing resources. In the experiments, the number of attention heads and the layer of Transformer encoders is set to 12. The hidden size $h$ that represents the dimension of token vectors is set to 768. Devlin et al. pointed out that gelu [32] activation is more suitable for BERT than other activation such as sigmoid, tanh or relu [2]. Therefore, we employ gelu activation in our model. To prevent overfitting, dropout layers are added to Topic-BERT. The probability of attention dropout and hidden dropout is set to 0.1. Adam is the optimizer of the model with a learning rate of 5e-5.

For additional information, the number of likes, comments, retweets, author’s followers and posts are divided into 20 categories by a discretization method directly as tokens in the input sequence. The document-topic distribution of the short text is used to calculate the attention weight of the proposed model. To verify the validity of our method, the Topic-BERT model is compared with the following methods:

Figure 5.

Topic distribution of GPU-DMM.

Figure 6.

Classification of harmful short texts.

•

Support Vector Machine (SVM) is a generalized linear classifier that classifies data in a supervised learning manner. Short texts are represented as bag-of-words, and they are classified using SVM.

•

Random Forests (RF) is a classifier that combines multiple tree predictors. This baseline is similar to SVM except that the classifier is turned into RF.

•

BERT is the basic model of the proposed model, and can also be directly used to identify harmful short texts.

4.3 Experimental results

4.3.1 Topic distribution

To verify the difference between the topic distribution of harmful short texts and normal short texts, we first employ DMM, BTM and GPU-DMM to extract 60 topics in the training set, respectively. Then, we record the proportion of short texts belonging to each topic. Since DMM and GPU-DMM follow the assumption that each short text has only one topic, for comparison, BTM only utilizes the topics with the highest probability in the topic distribution of each short text.

Figures 3–5 illustrate the topic distribution of BTM, DMM and GPU-DMM, respectively. Red columns indicate the five topics with the highest number of short texts. From the figures, we can notice that GPU-DMM achieves the highest discrimination in the topic distribution of two types of short texts, which verifies the effectiveness of GPU-DMM in harmful short text identification. The topics extracted by GPU-DMM are more discriminative and representative. The topic distribution of BTM is not highly discriminative, which is not helpful for harmful short text filtering systems. This may be because BTM only adds less word co-occurrence information and does not fundamentally solve the sparsity problem, as discussed in [33], which leads to the lack of better distinguishability of extracted topics. Furthermore, DMM is better than BTM in topic differentiation, which indicates that the assumption that each text has only one topic is applicable to short texts.

4.3.2 Harmful short text identification

We employ SVM, RF, BERT and Topic-BERT for harm short text identification, and Precision (P), Recall (R) and F1-measure (F1) are used to measure performance.

$\displaystyle P=\frac{TP}{TP+FP}$ $\displaystyle R=\frac{TP}{TP+FN}$ (8) $\displaystyle F1=\frac{2\times P\times R}{P+R},$

where $T P$ is the number of harmful short texts classified as true, $F P$ is the number of normal short texts classified as harmful short texts, $F N$ is the number of normal short texts classified as false.

In this section, we first compare the performance of four models for binary classification of harmful short texts. The proposed model directly uses coarse-grained labels to identify harmful short texts.

Figure 6a shows the precision, recall and F1-measure of four models on the binary classification task. From the figure, we observe that the performance of BERT is better than SVM and RF. This may be because BERT pre-trains a large network through massive external text corpus, and then uses the network to extract short text features and classify harmful short texts. This mechanism can effectively improve the performance of text classification tasks, which is consistent with the findings of [2]. The other observation is that Topic-BERT achieves the best performance compared with three baseline methods. The reason is that our model utilizes additional information such as likes, comments, retweets and topics to identify harmful short texts, which can alleviate the sparsity problem.

Next, we compare the performance of four methods for multiple classification of harmful short texts. In the Topic-BERT model, fine-grained labels are used to identify harmful short texts.

Figure 6b shows the precision, recall and F1-measure of four models on the multiple classification task. It can be intuitively found from the experimental results that all four models perform worse than the binary classification. This is because the multiple classification of harmful short texts is more difficult. The proposed model achieves the best results on three evaluation indicators. The experimental results verify that the two stages of Topic-BERT can promote each other and improve the performance effectively.

4.3.3 Ablation study

The experimental results of ablation study are shown in Fig. 7, which reveals the reasons that may affect the results. “One-stage” represents using only the first stage of Topic-BERT to classify harmful short texts, which still uses additional information and attention mechanism. It can be considered as the upper part of Fig. 1, and short texts are directly classified into fine-grained labels. As shown in Fig. 7, the precision, recall and F1 measure of the one-stage model all achieve the worst results, and precision is 9.56% lower than the complete two-stage model. This means that two-stage model can extract more semantic features, which is helpful for fine-grained multiple-classification.

Figure 7.

Ablation study.

For the two-stage model, we ablates some components of Topic-BERT to evaluate their effects. “Only text” denotes that only short texts are used and directly input into the two-stage model without using attention mechanism and additional information. Unsurprisingly, its performance is worse than the other two-stage models, but still better than the one-stage model. “ $+$ Information” refers to using only additional information such as the number of likes, retweets, comments, author’s followers and posts, and combining short texts with additional information as input to the model. “ $+$ Attention” indicates that only additional information of topics is used, and each hidden state in Topic-BERT is calculated with the topic distribution to obtain the attention weight. As shown in Fig. 7, we observe that on the original basis, each component achieves a 5–6% improvement. The experimental results verify that the attention mechanism and additional information are helpful for semantic feature extraction and harmful short text identification.

5. Conclusion

In this paper, we propose a two-stage model based on BERT for harmful short text identification, namely Topic-BERT. Topic-BERT first leverages part of additional information as inputs to solve the sparsity problem of short texts. Topics learned by GPU-DMM are used as another part of additional information to calculate the weight of the attention mechanism. Furthermore, the proposed model constructs two similar sub-models to distinguish labels of different granularities. Coarse-grained labels and fine-grained labels can promote each other and improve the performance effectively. As there are more multiple-classification tasks in real life, this modeling approach will have greater application prospects. The experimental results indicate the effectiveness of the proposed model compared with existing state-of-the-art methods. In the future, we will study how to extract image information to improve our model.

Footnotes

Acknowledgments

We would like to thank the anonymous reviews for their valuable comments. This work was funded by Jianghan University Doctoral Research Startup Fund Project (No. 1028/06060001).

References

Zhao

Yuan

. Detecting sensitive information of unstructured text Using convolutional neural network. In: International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC); 2019. pp. 474–479.

Devlin

Chang

Lee

Toutanova

. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT); 2019. pp. 4171–4186.

Gao

Zhu

Wang

. Detecting disaster-related tweets via multimodal adversarial neural network. IEEE MultiMedia. 2020; 27(4): 28–37.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

, et al. Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS); 2017. pp. 5998–6008.

Karacapilidis

Malefaki

Charissiadis

. A novel framework for augmenting the quality of explanations in recommender systems. Intelligent Decision Technologies. 2017; 11(2): 187–197.

Zhou

Wei

Zhou

. BERT-based lexical substitution. In: Annual Meeting of the Association for Computational Linguistics (ACL); 2019. pp. 3368–3373.

Mehta

Chandra

. NICFS: A novel feature selection method applied to lexicon based sentiment analysis. Intelligent Decision Technologies. 2019; 13(1): 41–48.

Phan

Nguyen

Horiguchi

. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: International Conference on World Wide Web (WWW); 2008. pp. 91–100.

Wang

Zhang

Sun

. Topic modeling for short texts with auxiliary word embeddings. In: International ACM Conference on Research and Development in Information Retrieval (SIGIR); 2016. pp. 165–174.

10.

Rao

Jin

Chen

Xiang

. Multi-label maximum entropy model for social emotion classification over short text. Neurocomputing. 2016; 210: 247–256.

11.

Zhang

. Concept based short text stream classification with topic drifting detection. In: IEEE International Conference on Data Mining (ICDM); 2016. pp. 1009–1014.

12.

Ravi

Kozareva

. Self-governing neural networks for on-device short text classification. In: Conference on Empirical Methods in Natural Language Processing (EMNLP); 2018. pp. 804–810.

13.

Blei

Jordan

. Latent Dirichlet allocation. Journal of Machine Learning Research. 2003; 3: 993–1022.

14.

Qiuxing

Lixiu

Jie

. Short text classification based on LDA topic model. In: International Conference on Audio, Language and Image Processing (ICALIP); 2016. pp. 749–753.

15.

Boom

Canneyt

Bohez

Demeester

Dhoedt

. Learning semantic similarity for very short texts. In: IEEE International Conference on Data Mining Workshop (ICDMW); 2015. pp. 1229–1234.

16.

Mikolov

Yih

Zweig

. Linguistic regularities in continuous space word representations. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT); 2013. pp. 746–751.

17.

Gao

Fang

Zhang

Yang

. Representation learning of knowledge graphs using convolutional neural networks. Neural Network World. 2020; 30: 145–160.

18.

Wang

Tian

Liu

Hao

. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing. 2016; 174: 806–814.

19.

Lee

Dernoncourt

. Sequential short-Text classification with recurrent and convolutional neural networks. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT); 2016. pp. 515–520.

20.

Zhang

Zhao

Lin

. Cluster-gated convolutional neural network for short text classification. In: Conference on Computational Natural Language Learning (CoNLL); 2019. pp. 1002–1011.

21.

Zhang

Gong

Yan

Duan

Wang

, et al. Pretraining-based natural language generation for text summarization. In: Conference on Computational Natural Language Learning (CoNLL); 2019. pp. 789–797.

22.

Sun

Qiu

Huang

. How to fine-Tune BERT for text classification? In: Chinese Computational Linguistics – 18th China National Conference, (CCL); 2019. pp. 194–206.

23.

Mao

Liu

. Factuality classification using the pre-trained language representation model BERT. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF); 2019. pp. 126–131.

24.

Christou

Tsoumakas

. Improving distantly-supervised relation extraction through BERT-based label and instance embeddings. IEEE Access. 2021; 9: 62574–62582.

25.

Giachanou

Rosso

. The battle against online harmful information: The cases of fake news and hate speech. In: ACM International Conference on Information and Knowledge Management (CIKM); 2020. pp. 3503–3504.

26.

Xiong

Gao

Wang

. A relative position attention network for aspect-based sentiment analysis. Knowledge and Information Systems. 2021; 63(2): 333–347.

27.

Gao

Peng

Wang

Zhang

Xie

Tian

. Incorporating word embeddings into topic modeling of short text. Knowledge and Information Systems. 2019; 61(2): 1123–1145.

28.

Gao

Peng

Wang

Zhang

Han

, et al. Generation of topic evolution graphs from short text streams. Neurocomputing. 2020; 383: 282–294.

29.

Mikolov

Yih

Zweig

30.

Cheng

Yan

Lan

Guo

. BTM: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering. 2014; 26(12): 2928–2941.

31.

Yin

Wang

. A dirichlet multinomial mixture model-based approach for short text clustering. In: ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD); 2014. pp. 233–242.

32.

Hendrycks

Gimpel

. Bridging nonlinearities and stochastic regularizers with Gaussian error linear units. CoRR. 2016; abs/1606.08415. Available from: http://arxiv.org/abs/1606.08415.

33.

Quan

Kit

Pan

. Short and sparse text topic modeling via self-aggregation. In: International Conference on Artificial Intelligence (IJCAI); 2015. pp. 2270–2276.

Topic-BERT: Detecting harmful information from social media

Abstract

Keywords

1. Introduction

3. Methodology

4.1 Datasets

1 https://open.weibo.com/.

4.2.1 GPU-DMM

3 http://baike.baidu.com/.

4.3.1 Topic distribution

4.3.2 Harmful short text identification

Footnotes

Acknowledgments

References

¹
https://open.weibo.com/.

³
http://baike.baidu.com/.